views:

533

answers:

4

I have a strange problem with lxml when using the deployed version of my Django application. I use lxml to parse another HTML page which I fetch from my server. This works perfectly well on my development server on my own computer, but for some reason it gives me UnicodeDecodeError on the server.

('utf8', "\x85why hello there!", 0, 1, 'unexpected code byte')

I have made sure that Apache (with mod_python) runs with LANG='en_US.UTF-8'.

I've tried googling for this problem and tried different approaches to decoding the string correctly, but I can't figure it out.

In your answer, you may assume that my string is called hello or something.

A: 

I suppose you have set your python coding system to utf-8 in site.py ?

fredz
That is never the right answer. It can be done for convenience, but masks other, real problems.
Jeremy Dunck
A: 

Since modifying site.py is not an ideal solution try this at the start of your program:

import sys
reload(sys)
sys.setdefaultencoding("utf-8")
cartman
Interesting! You don't happen to know what the start of my program is in Django?
Deniz Dogan
Calling this code before calling any lxml related code should fix the issue.
cartman
Unfortunately, it doesn't...
Deniz Dogan
Why is reload called right after import? Should it be called after sys.setdefaultencoding?
David Berger
I actually tried both ways with no success.
Deniz Dogan
The reason the reload(sys) is needed is that python's site module specifically removes it. This is because client code using setdefaultencoding is like a monkey with a gun.Please read http://www.joelonsoftware.com/articles/Unicode.html and http://groups.google.com/group/comp.lang.python/browse_thread/thread/7c6ceea571de69b1/92d3d2796bd46949Just don't do it.
Jeremy Dunck
+3  A: 

"\x85why hello there!" is not a utf-8 encoded string. You should try decoding the webpage before passing it to lxml. Check what encoding it uses by looking at the http headers when you fetch the page maybe you find the problem there.

THC4k
Agreed. Try "\x85why hello there!".decode("utf-8"). That will change the non-unicode code (\x85) to a unicode code. You may also need to add: "# -*- coding: utf-8 -*-" (without quotes) to the top of your .py file.
landyman
I concur that this is probably caused by you returning a byte string which is not, in fact, utf-8. In general, if you're claiming to create a UTF-8 string, you should internally be using the unicode type and, just before putting bits on the wire, encode to utf-8. If you munge byte strings together, you're almost certainly going to screw up somewhere, and then discover it somewhere much later. FWIW, "\x85" == "#".
Jeremy Dunck
A: 

Doesn't syntax such as u"\x85why hello there!" help?

You may find the following resources from the official Python documentation helpful:

atc