tags:

views:

643

answers:

3

Hi there,

I'm trying to create an xml document in python, however some of the strings i'm working with are encoded in unicode. Is there a way to create a text node using xml.dom.minidom using unicode strings? Is there another module I can use?

Thanks.

A: 

The dom objects seem to have an encoding argument, see 20.7.1 of the Python docs. Read the footnote as well; take care to use the proper encoding string.

Nick T
I'm not sure this is going to work for me. I need to actually create a dom object (using createTextNode(string)), but this fails if the string is unicode. 20.7.1 seems to allow me to create unicode strings from nodes, but not nodes from unicode strings. Is there another way? Or, am I just misunderstanding the documentation?
Jordan
+1  A: 

In theory, per the docs:

the DOMString defined in the recommendation is mapped to a Python string or Unicode string. Applications should be able to handle Unicode whenever a string is returned from the DOM.

so you should be fine with either a Unicode string, or a Python string (utf-8 is the default encoding in XML).

In practice, in Python 2, I've sometimes had problems with Unicode strings in xml.dom (I've switched almost entirely away from it and to ElementTree a while ago, so I'm not positive that the problems are still there in recent Python 2 releases).

If you do meet problems using Unicode strings directly, I think you'll want to try encoded strings instead, e.g., thedoc.createTextNode(u'pié'.encode('utf-8')).

In Python 3, of course, strs are Unicode, so everything's rather different in this regard;-).

Alex Martelli
Exactly what I wanted. Thanks!
Jordan
?? This is exactly what you mustn't do. Text node data in the DOM is defined as Unicode. Passing a byte string in instead results in a faulty infoset, which will give you UnicodeErrors in later processing.
bobince
@bobince, in theory you're perfectly correct, in practice I've sometimes managed to work minidom with str where it was failing with unicode (the docs say it takes either!) -- expanded my answer to clarify.
Alex Martelli
A: 

Is there a way to create a text node using xml.dom.minidom using unicode strings?

Yes, createTextNode always takes Unicode strings. The text model of the XML information set is Unicode, as you can see:

>>> doc= minidom.parseString('<a>b</a>')
>>> doc.documentElement.firstChild.data
u'b'

So:

>>> doc.createTextNode(u'Hell\xF6') # OK
<DOM Text node "u'Hell\xf6'">

Minidom does allow you to put non-Unicode strings in the DOM, but if you do and they contain non-ASCII characters you'll come a cropper later on:

>>> doc.documentElement.appendChild(doc.createTextNode('Hell\xF6')) # Wrong, not Unicode string
<DOM Text node "'Hell\xF6'">

>>> doc.toxml()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 45, in toxml
    return self.toprettyxml("", "", encoding)
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 60, in toprettyxml
    return writer.getvalue()
  File "/usr/lib/python2.6/StringIO.py", line 270, in getvalue
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

This is assuming that by “encoded in unicode” you mean you are using Unicode strings. If you mean something else, like you've got byte strings in a UTF-8 encoding, you need to convert those byte strings to Unicode strings before you put them in the DOM:

>>> b= 'Hell\xc3\xb6'    # Hellö encoded in UTF-8 bytes
>>> u= b.decode('utf-8') # Proper Unicode string Hellö
>>> doc.documentElement.appendChild(doc.createTextNode(u))
>>> doc.toxml()
u'<?xml version="1.0" ?><a>bHell\xf6</a>' # correct!
bobince