Hi,
I want to determine whether to different child nodes within an XML document are equal or not. Two nodes should be considered equal if they have the same set of attributes and child notes and all child notes are equal, too (i.e. the whole sub tree should be equal).
The input document might be very large (up to 60MB, more than a 100000 nodes to compare) and performance is an issue.
What would be an efficient way to check for the equality of two nodes?
Example:
<w:p>
<w:pPr>
<w:spacing w:after="120"/>
</w:pPr>
<w:r>
<w:t>Hello</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:spacing w:after="240"/>
</w:pPr>
<w:r>
<w:t>World</w:t>
</w:r>
</w:p>
This XML snippet describes paragraphs in an OpenXML document. The algorithm would be used to determine whether a document contains a paragraph (w:p node) with the same properties (w:pPr node) as another paragraph earlier in the document.
One idea I have would be to store the nodes' outer XML in a hash set (Normally I would have to get a canonical string representation first where attributes and child notes are sorted always in the same way, but I can expect my nodes already to be in such a form).
Another idea would be to create an XmlNode object for each node and write a comparer which compares all attributes and child nodes.
My environment is C# (.Net 2.0); any feedback and further ideas are very welcome. Maybe somebody even has already a good solution?
EDIT: Microsoft's XmlDiff API can actually do that but I was wondering whether there would be a more lightweight approach. XmlDiff seems to always produce a diffgram and to always produce a canonical node representation first, both things which I don't need.
EDIT2: I finally implemented my own XmlNodeEqualityComparer based on the suggestion made here. Thanks a lot!!!!
Thanks, divo