views:

151

answers:

2

I used lxml to parse some web page as below:

>>> doc = lxml.html.fromstring(htmldata)
>>> element in doc.cssselect(sometag)[0]
>>> text = element.text_content()
>>> print text
u'Waldenstr\xf6m'

Why it prints u'Waldenstr\xf6m' but not "Waldenström" here?

After that, I tried to add this text to a MySQL table with UTF-8 character set and utf8_general_ci collatio, Users is a Django model:

>>> Users.objects.create(last_name=text)
'ascii' codec can't encode character u'\xf6' in position 9: ordinal not in range(128)

What I was doing wrong here? How can I get the the correct data "Waldenström" and write it to database?

+2  A: 

you want text.encode('utf8')

Art Gillespie
yes, i tried this but it also gave the same error.
jack
ok, it works now. thanks art.
jack
A: 
>>> print text
u'Waldenstr\xf6m'

There is a difference between displaying something in the shell (which uses the repr) and printing it (which just spits out the string):

>>> u'Waldenstr\xf6m'
u'Waldenstr\xf6m'

>>> print u'Waldenstr\xf6m'
Waldenström

So, I'm not sure your snippet above is really what happened. If it definitely is, then your XHTML must contain exactly that string:

<div class="something">u'Waldenstr\xf6m'</div>

(maybe it was incorrectly generated by Python using a string's repr() instead of its str()?)

If this is right and intentional, you would need to parse that Python string literal into a simple string. One way of doing that would be:

>>> r= r"u'Waldenstr\xf6m'"
>>> print r[2:-1].decode('unicode-escape')
Waldenström

If the snippet at the top is actually not quite right and you are simply asking why Python's repr escapes all non-ASCII characters, the answer is that printing non-ASCII to the console is unreliable across various environments so the escape is safer. In the above examples you might have received ?s or worse instead of the ö if you were unlucky.

In Python 3 this changes:

>>> 'Waldenstr\xf6m'
'Waldenström'
bobince