ansaurus

Question

Efficient algorithm for comparing XML nodes

Answer 1

A:

not a direct answer to your question, but closely related to what you are trying to acheive: have a look at XmlDiff (.net XML power tools)

PW 2008-12-05 12:42:57

Answer 2

+3 A:

What about this approach:

For all <w:pPr> nodes in the document (I suppose there is not more than one per <w:p>), concatenate all relevant data (element names, attributes, values) into a string:

// string format is really irrelevant, so this is just a bogus example
'!w:keep-with-next@value="true"!w:spacing@w:before="10"@w:after="120"'

Do so on alphabetical order, to account for varying document order.

Build a collection using these strings as the key and the reference to the respective <w:p> node as the value.

In the process of doing this, when you hit the point that a given key already exists in the collection, you found a paragraph with the same properties. Work with a list of nodes as the collection value, if you want to keep collecting.

I can't say how well this would perform, but I guess it is not too hard to implement and find out.

Tomalak 2008-12-05 13:12:48

I've made a one-off XSLT transformation that does the described string-building: http://pastebin.me/49393ec501fe9. Maybe its faster than doing it "manually" by DOM traversal. You will get a list of such elements: '<para position="2">!w:spacing@w:after"240"</para>'.

Tomalak 2008-12-05 14:50:50

Answer 3

+5 A:

I'd recommend against rolling your own hash creation function and instead rely on the in-built XNodeEqualityComparer's GetHashCode method. This guarantees to take account of attributes and descendant nodes when creating the result and could save you some time too.

Your code would look like the following:

XNodeEqualityComparer comparer = new XNodeEqualityComparer();
XDocument doc = XDocument.Load("XmlFile1.xml");
Dictionary<int, XNode> nodeDictionary = new Dictionary<int, XNode>();

foreach (XNode node in doc.Elements("doc").Elements("node"))
{
    int hash = comparer.GetHashCode(node);
    if (nodeDictionary.ContainsKey(hash))
    {
        // A duplicate has been found. Execute your logic here
        // ...
    }
    else
    {
        nodeDictionary.Add(hash, node);
    }
}

My XmlFile1.xml is:

<?xml version="1.0" encoding="utf-8" ?>
<doc>
  <node att="A">Blah</node>
  <node att="A">Blah</node>
  <node att="B">
    <inner>Innertext</inner>
  </node>
  <node>Blah</node>
  <node att="B">
    <inner>Different</inner>
  </node>
</doc>

nodeDictionary will end up containing a unique collection of Nodes and their hashes. Duplicates are detected by using the Dictionary's ContainsKey method, passing in the hash of the node, which we generate using the XNodeEqualityComparer's GetHashCode method.

I think this should be fast enough for your needs.

Dave R. 2008-12-05 14:00:47

XNodeEqualityComparer is a part of the 3.5 Framework, and his post would imply they are using 2.0. I would agree this is probably the best way to do it though, and they could include the relevent libraries maybe?

ICR 2008-12-05 15:20:19

Answer 4

+3 A:

It is very challenging even to define correctly the problem of

"When two xml documents are equal?"

There are many reasons for this:

An XML document is a tree that may have different textual representations.
Whitespace-only nodes may or may not be considered in a comparison
Comment nodes may or may not be considered in a comparison
PI nodes may or may not be considered in a comparison
Lexical differences: or
Different prefixes may be associated with the same namespace in the two documents
A namespace node may be shown as defined on a node of doc1 and as not defined but inherited from the parent of the corresponding node in doc2
Quotes may be used around an attribute in doc1 but apostrophes may be used in doc2
Entities may be used in doc1 but they may be pre-expanded in doc2
The two documents may have different but semantically equivalent DTDs
Etc.

Therefore it seems naive and unrealistic to try to produce a correct implementation of a function for equality comparison of two XML documents.

My recommendation is to use the deep-equal() function with a compliant XPath 2.0 engine.

Dimitre Novatchev 2008-12-05 14:37:45

There is a W3C recommendation to handle these problems: Canonical XML Version 1.0 (http://www.w3.org/TR/xml-c14n.html). The problems that you describe must be considered in many cases, e.g. when validating the digital signature of an XML document.

0xA3 2008-12-05 14:58:35

It doesn't handle all of these problems. For example, there is no namspace rewriting: http://www.w3.org/TR/xml-c14n.html#NoNSPrefixRewriting. Therefore, two XML documents which only have a different prefix associated with a specific namespace, will have different cannonicalizations.

Dimitre Novatchev 2008-12-05 18:45:18

Answer 5

+1 A:

Here is a hash function I have knocked up that attempts to solve a part of your problem. Note that I have very little experience writing hash functions, and have included it mainly to get feedback from people as to it's effectiveness in solving this particular problem. I would not recommend it's use in production.

static int HashXElement(XElement elem)
{
    int hash = 23;

    foreach (XAttribute attrib in elem.Attributes())
    {
        int attribHash = 23;
        attribHash = attribHash * 37 + attrib.Name.GetHashCode();
        attribHash = attribHash * 37 + attrib.Value.GetHashCode();
        hash = hash ^ attribHash;
    }

    foreach(XElement subElem in elem.Descendants())
    {
        hash = hash * 37 + XmlHash(subElem);
    }

    hash = hash * 37 + elem.Value.GetHashCode();

    return hash;
}

The ideas was to make the ordering of subnodes significant, but the ordering of attributes not significant.

ICR 2008-12-05 15:24:05

ansaurus

tags:

views:

answers:

Efficient algorithm for comparing XML nodes

related questions