Does anyone know how I would get a DOM instance (tree) of an XML file in Python. I am trying to compare two XML documents to eachother that may have elements and attributes in different order. How would I do this?
For comparing XML document instances, a naive compare of the parsed DOM trees will not work. You will probably need to implement your own NodeComperator that recursively compares a node and its child-nodes with some other node and its child-nodes based on your specific criteria such as:
- When is the order of child elements significant?
- When is whitespace in text-content significant?
- Are there default values for some elements and are they applied by your parser?
- Should entity references be expanded for comparison
Minidom is a good starting point for parsing the files and is easy to use. The actual implementation of the comparison function for your specific application however needs to be done by you.
Personally, whenever possible, I'd start with elementtree (preferably the C implementation that comes with Python's standard library, or the lxml implementation, but that's essentialy a matter of higher speed, only). It's not a standard-compliant DOM, but holds the same information in a more Pythonic and handier way. You can start by calling xml.etree.ElementTree.parse
, which takes the XML source and returns an element-tree; do that on both sources, use getroot
on each element tree to obtain its root element, then recursively compare elements starting from the root ones.
Children of an element form a sequence, in element tree just as in the standard DOM, meaning their order is considered important; but it's easy to make Python sets out of them (or with a little more effort "multi-sets" of some kind, if repetitions are important in your use case though order is not) for a laxer comparison. It's even easier for attributes for a given element, where uniqueness is assured and order is semantically not relevant.
Is there some specific reason you need a standard DOM rather than an alternative container like an element tree, or are you just using the term DOM in a general sense so that element tree would be OK?
In the past I've also had good results using PyRXP, which uses an even starker and simpler representation than ElementTree. However, it WAS years and years ago; I have no recent experience as to how PyRXP today compares with lxml or cElementTree.