How to handle Unicode (non-ASCII) characters in Python?

views:

429

answers:

+1 Q:

How to handle Unicode (non-ASCII) characters in Python?

I'm programming in Python and I'm obtaining information from a web page through the urllib2 library. The problem is that that page can provide me with non-ASCII characters, like 'ñ', 'á', etc. In the very moment urllib2 gets this character, it provokes an exception, like this:

(more stack trace)
File "c:\Python25\lib\httplib.py", line 711, in send self.sock.sendall(str)
File "", line 1, in sendall: UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 74: ordinal not in range(128)

I need to handle those characters. I mean, I don't want to handle the exception but to continue the program. Is there any way to, for example (I don't know if this is something stupid), use another codec rather than the ASCII? Because I have to work with those characters, insert them in a database, etc.

+6 A:

You want to use unicode for all your work if you can.

You probably will find this question/answer useful:

http://stackoverflow.com/questions/1020892/python-urllib2-read-to-unicode

Paul McMillan 2009-10-29 15:45:13

You might want to look into using an actual parsing library to find this information. lxml, for instance, already addresses Unicode encode/decode using the declared character set.

Hank Gay 2009-10-29 16:08:22

+1 A:

You just read a set of bytes from the socket. If you want a string you have to decode it:

yourstring = receivedbytes.decode("utf-8")

(substituting whatever encoding you're using for "utf-8")

Then you have to do the reverse to send it back out:

outbytes = yourstring.encode("utf-8")

dsimard 2009-10-29 16:58:42

ansaurus

tags:

views:

answers:

How to handle Unicode (non-ASCII) characters in Python?

related questions