OK, the docs for Python's libxml2 bindings are really ****
. My problem:
An XML document is stored in a string variable in Python. The string is a instance of Unicode, and there are non-ASCII characters in it. I want to parse it with libxml2, looking something like this:
# -*- coding: utf-8 -*-
import libxml2
DOC = u"""<?xml version="1.0" encoding="UTF-8"?>
<data>
<something>Bäääh!</something>
</data>
"""
xml_doc = libxml2.parseDoc(DOC)
with this result:
Traceback (most recent call last):
File "test.py", line 13, in <module>
xml_doc = libxml2.parseDoc(DOC)
File "c:\Python26\lib\site-packages\libxml2.py", line 1237, in parseDoc
ret = libxml2mod.xmlParseDoc(cur)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 46-48:
ordinal not in range(128)
The point is the u"..."
declaration. If I replace it with a simple ".."
, then everything is ok. Unfortunately it doesn't work in my setup, because DOC
will definitely be a Unicode instance.
Has anyone an idea how libxml2 can be brought to parse UTF-8 encoded strings?