ansaurus

Question

Encoding in python with lxml - complex solution

Answer 1

+2 A:

lxml can be a little wonky about input encodings. It is best to just sent UTF8 in and get UTF8 out. You might want to use the chardet module or Unicode Dammit to decode the actual data. Vaguely like:

import chardet
from lxml import html
content = urllib2.urlopen(url).read()
encoding = chardet.detect(content)['encoding']
if encoding != 'utf-8':
    content = content.decode(encoding, 'replace').encode('utf-8')
doc = html.fromstring(content, base_url=url)

I'm not sure why you are moving between lxml and etree, unless you are interacting with another library that already uses etree?

Ian Bicking 2010-04-22 06:21:11

Unicode Dammit seems good. And about etree you are right, I've remove it from code.

Vojtech R. 2010-04-23 14:50:14

ansaurus

tags:

views:

answers:

Encoding in python with lxml - complex solution

related questions