tags:

views:

47

answers:

2

I have an XML file that specifies an encoding, and I use UnicodeDammit to convert it to unicode (for reasons of storage, I can't store it as a string). I later pass it to lxml but it refuses to ignore the encoding specified in the file and parse it as Unicode, and it raises an exception.

How can I force lxml to parse the document? This behaviour seems too restrictive.

A: 

Basically, the solution is to do:

if isinstance(mystring, unicode):
    mystring = mystring.encode("utf-8")

Seriously. Good job, lxml.

EDIT: It turns out that, in this instance, lxml autodetects the encoding incorrectly. It appears that I will have to manually search for and remove "charset" and "encoding" from the page.

Stavros Korokithakis
+2  A: 

You cannot parse from unicode strings AND have an encoding declaration in the string. So, either you make it an encoded string (as you apparently can't store it as a string, you will have to re-encode it before parsing. Or you serialize the tree as unicode with lxml yourself: etree.tostring(tree, encoding=unicode), WITHOUT xml declaration. You can easily parse the result again with etree.fromunicode

see http://codespeak.net/lxml/parsing.html#python-unicode-strings

Edit: If, apparently, you already have the unicode string, and can't control how that was made. You'll have to encode it again, and provide the parser with the encoding you used:

utf8_parser = etree.XMLParser(encoding='utf-8')

def parse_from_unicode(unicode_str):
    s = unicode_str.encode('utf-8')
    return etree.fromstring(s, parser=utf8_parser)

This will make sure that, whatever was inside the xml declaration gets ignored, because the parser will always use utf-8.

Steven
The whole problem is that I can't get a tree in the first place, if I could I wouldn't have any problems...
Stavros Korokithakis
@Stavros Korokithakis, etree is module, not the parsed tree.
Daniel Kluev
@Daniel Kluev: Yes, but "tree" is a tree.
Stavros Korokithakis
@Steven: Regarding your edit, that should work, thanks. In the end, I took the encoding detection regex from lxml and used it to strip the encoding from the file. Since it fails early, I think it should be the fastest.
Stavros Korokithakis