views:

283

answers:

5

I'm looking for an efficient way to compare and get differences between two XML-based parse trees.

What would you suggest to be the best way to store those differences? I would have done this:

XML A:

<w:p>
  <w:pPr>
    <w:spacing w:after="1"/>
  </w:pPr>
  <w:r>
    <w:t>World</w:t>
  </w:r>
</w:p>

XML B:

<w:p>
  <w:pPr>
    <w:spacing w:after="1"/>
  </w:pPr>
  <w:r>
    <w:t>ASDF</w:t>
  </w:r>
</w:p>

The algorithm determines that "World" has changed to "ASDF" and then stores:

div: <w:p><w:r><w:t>World</w:t> -> <w:p><w:r><w:t>ASDF</w:t>

Is this enough to cover all cases that might occur?

Does anybody know of a good way to do that? Any help would really be appreciated!

+1  A: 

It might get harder. Look at this example:

<w:p>
  <w:pPr>
    <w:spacing w:after="1"/>
  </w:pPr>
  <w:r>
    <w:t>World</w:t> <-- Case 1: this changes to <w:t>ASDF</w:t>
    <w:t>World</w:t> <-- Case 2: this changes to <w:t>ASDF</w:t>
  </w:r>
</w:p>

To be able to recognize both cases, you'll have to store one as

 div: <w:p><w:r><w:t>World</w:t> -> <w:p><w:r><w:t>ASDF</w:t>

and the other as

 div: <w:p><w:r><w:t>World</w:t><w:t>World</w:t> -> <w:p><w:r><w:t>World</w:t><w:t>ASDF</w:t>

or something similar (you might also want to add "w:p" closing tags to both of them to make them valid XML sub-trees).

In general, such programs can get very complicated, so I wouldn't recommend you to create something completely new but to either use some existing diff algorithm (most will be good enough even without parsing the XML structure) or to modify one of them to suit your needs.

schnaader
A: 

Use the SO search feature to search for xml diff

anon
A: 

XMLDiff:

Explains how to use the XML Diff and Patch tool, which compares two XML files and produces an XML output of the differences, by utilizing a typical scenario that readers can apply to their own applications.

Mitch Wheat
A: 

How about a simple depth-first search over the common part? That is, do a depth-first search and as soon as you encounter a difference store it and start backtracking. The additional information needed to construct the context part of the output can be easily stored on the "backtrack stack".

mweerden
A: 

Hi,

when you want to compare the difference between two trees and somehow produce the "difference" out of that comparison, you are basically looking at a variant of tree edit distance problem. For a starter, check out this paper.

A more common "edit distance" problem is the problem of edit distance for strings. Version control software like CVS or SVN which use "delta coding" for storing the changes made to files use variants of the string edit distance algorithms to calculate the deltas. The case of trees is less common but definitely interesting.

antti.huima