views:

105

answers:

2

Hi! I was happily using BeautifulSoup and I'm also using a text file as input parameters of my Python script.

I then came across the famous "UnicodeEncodeError" error.

I've been reading questions here at SO but I'm still confused.

What does ASCII got to do with all of these? What encoding do I use on my text editor (Notepad++)? ANSI? UTF-8? Decoding a string to ASCII doesn't seem to always work (I'm guessing the string is in a different encoding coming from BeautifulSoup). How do I fix this?

Anyway any help and clarifications will be greatly appreciated.

Thanks!

edit: reading BeautifulSoup's docs, it says that it only uses unicode but I'm still getting Unicode errors :(

  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u300d' in position
 3: character maps to <undefined>
+1  A: 

ANSI is not a character encoding (in common parlance it refers to certain escape sequences, though it's of course the acronym for the American National Standard Institute). You can set the encoding in Notepad++ (and check what encoding you're using) -- hopefully utf-8, because that's a universal encoding (lets you represent any Unicode point). You build unicode from your utf-8 encoded text with an explicit decode method call, or you read the file as unicode with a codecs.open (both require you to specify your encoding name -- again, hopefully 'utf8').

Alex Martelli
I'm confused because I see "Encode in ANSI" option with Notepad++What about the strings coming from BeautifulSoup scraped from HTML pages? They might not always be utf-8..BTW Alex, what editor do you usually use?
@grokker: There's really two meanings to "ANSI" in the context of text. One is a set of escape sequences used for terminal control. In the DOS/Windows world, there is also a character set which, against all reason, is referred to as "ANSI". Today it's actually Windows-1252 (http://en.wikipedia.org/wiki/Windows-1252), which is almost-but-not-quite the ISO Latin-1 set (though it hasn't always been such). That's probably what Notepad++ is referring to.
Nicholas Knight
@Nicholas: "ANSI" character set is standard Windows (not DOS) jargon for cp125x for x in range(9) and varies with the locale; it's not just cp1252. What does "though it hasn't always been such" mean?
John Machin
@grokker, I almost always use vim (since I learned vi in the '70s, my fingers' "muscle memory" has nullified all the horror cries of ergonomics experts against its modal nature;-), often in the gvim GUI version (real vi purists use vim in a terminal, always;-). People either love (a few of us) or hate (more people) vi...!-)
Alex Martelli
Thanks Alex. I actually use vim whenever I can especially when editing files via ssh. Although I find IDEs very convenient for most purposes, vim will always have a special place :)
+1  A: 

What does ASCII got to do with all of these?

Python has no way to find out what encoding was used to store text, so it assumes ascii by default. However, ASCII defines only first 128 chars, so anything outside results in decode error (which is actually good thing, since it does not let you use incorrectly decoded strings around).

Most of the time your string would be in utf-8, since its most common way to encode Unicode, so its usually safe to do s.decode('utf-8') on str type strings (or use unicode(s, 'utf-8') call)

If you dont know in advance what kind of encoding text has, and it provides no encoding metadata, you can try using chardet module.

BeautifulSoup can output result in different encodings and ways, so you just need to specify that you want unicode there.

Daniel Kluev