ansaurus

Question

ANSI, ASCII, Unicode and encoding confusion with Python

Answer 1

+1 A:

ANSI is not a character encoding (in common parlance it refers to certain escape sequences, though it's of course the acronym for the American National Standard Institute). You can set the encoding in Notepad++ (and check what encoding you're using) -- hopefully utf-8, because that's a universal encoding (lets you represent any Unicode point). You build unicode from your utf-8 encoded text with an explicit decode method call, or you read the file as unicode with a codecs.open (both require you to specify your encoding name -- again, hopefully 'utf8').

Alex Martelli 2010-07-24 06:11:24

I'm confused because I see "Encode in ANSI" option with Notepad++What about the strings coming from BeautifulSoup scraped from HTML pages? They might not always be utf-8..BTW Alex, what editor do you usually use?

2010-07-24 06:13:47

@grokker: There's really two meanings to "ANSI" in the context of text. One is a set of escape sequences used for terminal control. In the DOS/Windows world, there is also a character set which, against all reason, is referred to as "ANSI". Today it's actually Windows-1252 (http://en.wikipedia.org/wiki/Windows-1252), which is almost-but-not-quite the ISO Latin-1 set (though it hasn't always been such). That's probably what Notepad++ is referring to.

Nicholas Knight 2010-07-24 06:19:12

@Nicholas: "ANSI" character set is standard Windows (not DOS) jargon for cp125x for x in range(9) and varies with the locale; it's not just cp1252. What does "though it hasn't always been such" mean?

John Machin 2010-07-24 07:57:42

@grokker, I almost always use vim (since I learned vi in the '70s, my fingers' "muscle memory" has nullified all the horror cries of ergonomics experts against its modal nature;-), often in the gvim GUI version (real vi purists use vim in a terminal, always;-). People either love (a few of us) or hate (more people) vi...!-)

Alex Martelli 2010-07-24 15:13:34

Thanks Alex. I actually use vim whenever I can especially when editing files via ssh. Although I find IDEs very convenient for most purposes, vim will always have a special place :)

2010-07-24 21:26:09

Answer 2

+1 A:

What does ASCII got to do with all of these?

Python has no way to find out what encoding was used to store text, so it assumes ascii by default. However, ASCII defines only first 128 chars, so anything outside results in decode error (which is actually good thing, since it does not let you use incorrectly decoded strings around).

Most of the time your string would be in utf-8, since its most common way to encode Unicode, so its usually safe to do s.decode('utf-8') on str type strings (or use unicode(s, 'utf-8') call)

If you dont know in advance what kind of encoding text has, and it provides no encoding metadata, you can try using chardet module.

BeautifulSoup can output result in different encodings and ways, so you just need to specify that you want unicode there.

Daniel Kluev 2010-07-24 06:20:08

ansaurus

tags:

views:

answers:

ANSI, ASCII, Unicode and encoding confusion with Python

related questions