ansaurus

Question

Is there a way to force lxml to parse Unicode strings that specify an encoding in a tag?

Answer 1

A:

Basically, the solution is to do:

if isinstance(mystring, unicode):
    mystring = mystring.encode("utf-8")

Seriously. Good job, lxml.

EDIT: It turns out that, in this instance, lxml autodetects the encoding incorrectly. It appears that I will have to manually search for and remove "charset" and "encoding" from the page.

Stavros Korokithakis 2010-08-04 04:20:19

Answer 2

+2 A:

You cannot parse from unicode strings AND have an encoding declaration in the string. So, either you make it an encoded string (as you apparently can't store it as a string, you will have to re-encode it before parsing. Or you serialize the tree as unicode with lxml yourself: etree.tostring(tree, encoding=unicode), WITHOUT xml declaration. You can easily parse the result again with etree.fromunicode

see http://codespeak.net/lxml/parsing.html#python-unicode-strings

Edit: If, apparently, you already have the unicode string, and can't control how that was made. You'll have to encode it again, and provide the parser with the encoding you used:

utf8_parser = etree.XMLParser(encoding='utf-8')

def parse_from_unicode(unicode_str):
    s = unicode_str.encode('utf-8')
    return etree.fromstring(s, parser=utf8_parser)

This will make sure that, whatever was inside the xml declaration gets ignored, because the parser will always use utf-8.

Steven 2010-08-04 08:51:00

The whole problem is that I can't get a tree in the first place, if I could I wouldn't have any problems...

Stavros Korokithakis 2010-08-05 13:49:50

@Stavros Korokithakis, etree is module, not the parsed tree.

Daniel Kluev 2010-08-05 17:08:51

@Daniel Kluev: Yes, but "tree" is a tree.

Stavros Korokithakis 2010-08-06 04:15:56

@Steven: Regarding your edit, that should work, thanks. In the end, I took the encoding detection regex from lxml and used it to strip the encoding from the file. Since it fails early, I think it should be the fastest.

Stavros Korokithakis 2010-08-06 04:17:47

ansaurus

tags:

views:

answers:

Is there a way to force lxml to parse Unicode strings that specify an encoding in a tag?

related questions