tags:

views:

146

answers:

1

I need a validated DomTree with DTD (to use getElementById). Validating and Parsing works, but the dom does't work properly:

from xml.dom import minidom 
from xml.dom.pulldom import SAX2DOM
from lxml import etree
import lxml.sax
from StringIO import StringIO

data_string = """\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE foo [
<!ELEMENT foo (bar)*>
<!ELEMENT bar (#PCDATA)>
<!ATTLIST bar id ID #REQUIRED>]><foo><bar id="nr_0">text</bar></foo> 
"""

#parser, with vali. at parsing
etree_parser = etree.XMLParser(dtd_validation=True,attribute_defaults=True) 
#parse it
sax_tree = etree.parse(StringIO(data_string),etree_parser);
handler = SAX2DOM();
lxml.sax.saxify(sax_tree,handler);
domObject = handler.document;

print domObject.getElementById("nr_0");
#returns None

print minidom.parseString(data_string).getElementById("nr_0");
#returns <DOM Element: bar at 0x7f36b77dc0e0>

It seems that SAX2DOM wont pass the DTD to the dom. Did I forgott something? I've read it is impossible to load the DTD after the dom is build.

any ideas?

+1  A: 

As far as I know: SAX DTD events are not handled by the ContentHandler, but by the DTDHandler, which is a property you can set on the sax parser (XMLReader). This means that you cannot do this without serializing and reparsing the document.

validated_string = etree.tostring(tree)
domDocument = minidom.parseString(validated_string)

On the other hand: unless you really need a minidom document, you'd be better off just staying with the lxml tree. (you can use xpath for the equivalent of getElementById, or have a look at etree.XMLDTDID and etree.parseid)

Steven
Hmmm, I think you right. Reparsing isn't really an option. I googled a bit for etree, looks like it is better than minidom in every respect. Thx!
Carsten C.