views:

429

answers:

3

I'm programming in Python and I'm obtaining information from a web page through the urllib2 library. The problem is that that page can provide me with non-ASCII characters, like 'ñ', 'á', etc. In the very moment urllib2 gets this character, it provokes an exception, like this:

  • (more stack trace)

  • File "c:\Python25\lib\httplib.py", line 711, in send self.sock.sendall(str)

  • File "", line 1, in sendall: UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 74: ordinal not in range(128)

I need to handle those characters. I mean, I don't want to handle the exception but to continue the program. Is there any way to, for example (I don't know if this is something stupid), use another codec rather than the ASCII? Because I have to work with those characters, insert them in a database, etc.

+6  A: 

You want to use unicode for all your work if you can.

You probably will find this question/answer useful:

http://stackoverflow.com/questions/1020892/python-urllib2-read-to-unicode

Paul McMillan
A: 

You might want to look into using an actual parsing library to find this information. lxml, for instance, already addresses Unicode encode/decode using the declared character set.

Hank Gay
+1  A: 

You just read a set of bytes from the socket. If you want a string you have to decode it:

yourstring = receivedbytes.decode("utf-8")

(substituting whatever encoding you're using for "utf-8")

Then you have to do the reverse to send it back out:

outbytes = yourstring.encode("utf-8")

dsimard