views:

382

answers:

3

I'm puzzled by minidom parser handling of empty element, as shown in following code section.

import xml.dom.minidom

doc = xml.dom.minidom.parseString('<value></value>')
print doc.firstChild.nodeValue.__repr__()
# Out: None
print doc.firstChild.toxml()
# Out: <value/>

doc = xml.dom.minidom.Document()
v = doc.appendChild(doc.createElement('value'))
v.appendChild(doc.createTextNode(''))
print v.firstChild.nodeValue.__repr__()
# Out: ''
print doc.firstChild.toxml()
# Out: <value></value>

How can I get consistent behavior? I'd like to receive empty string as value of empty element (which IS what I put in XML structure in the first place).

+1  A: 
value = thing.firstChild.nodeValue or ''
John Machin
Unfortunately, this does not solve my problem. In my code, I call method replaceWholeText upon TextElement in XML document. If I previously stored empty string in that TextElement, it would disappear next time XML file is parsed, and I would be unable to call method replaceWholeText. I could rebuild that element if it's not there, but that would be a very ugly hack.
Josip
What do you mean "rebuild the element"? It exists, its value just happens to be None instead of ''.
John Machin
+1  A: 

Xml spec does not distinguish these two cases.

Lloyd
I agree. That's why it seems strange that minidom module makes distinction. I can manually create empty text node and retrieve it as such, but once XML document is saved to disk and parsed again, empty text node becomes None object.
Josip
+1  A: 

Cracking open xml.dom.minidom and searching for "/>", we find this:

# Method of the Element(Node) class.
def writexml(self, writer, indent="", addindent="", newl=""):
    # [snip]
    if self.childNodes:
        writer.write(">%s"%(newl))
        for node in self.childNodes:
            node.writexml(writer,indent+addindent,addindent,newl)
        writer.write("%s</%s>%s" % (indent,self.tagName,newl))
    else:
        writer.write("/>%s"%(newl))

We can deduce from this that the short-end-tag form only occurs when childNodes is an empty list. Indeed, this seems to be true:

>>> doc = Document()
>>> v = doc.appendChild(doc.createElement('v'))
>>> v.toxml()
'<v/>'
>>> v.childNodes
[]
>>> v.appendChild(doc.createTextNode(''))
<DOM Text node "''">
>>> v.childNodes
[<DOM Text node "''">]
>>> v.toxml()
'<v></v>'

As pointed out by Lloyd, the XML spec makes no distinction between the two. If your code does make the distinction, that means you need to rethink how you want to serialize your data.

xml.dom.minidom simply displays something differently because it's easier to code. You can, however, get consistent output. Simply inherit the Element class and override the toxml method such that it will print out the short-end-tag form when there are no child nodes with non-empty text content. Then monkeypatch the module to use your new Element class.

Hao Lian
That's my point exactly. XML spec defines two forms as equivalent, but minidom treats <v></v> as '' if created at runtime, and yet parses <v></v> to Element "v" without TextElemet child node.
Josip
I followed your advice to change my approach on data serialization. I'll try JSON, as it better fits my needs. Thanks for help.
Josip