views:

92

answers:

2

I have two tree structures that represent snapshots of a directory structure at two different points in time. Directories may have been added, removed or modified between the snapshots. I need to walk the two trees simultaneously and mark the newer with the differences between the two - i.e. flag nodes as New, Modified, Deleted, Unchanged, adding any deleted nodes, so that the end result is the full superset of the two snapshots.

Typically, the trees are likely to be about 10 deep but very wide, containing hundreds of thousands, potentially millions of nodes. I want to skip large chunks of the trees by comparing hash codes at each node and only continuing to recurse where the codes don't match.

Is there an algorithm that could be my friend here? Any other advice?

+1  A: 

Imagine unrolling each tree into a sorted list of files and directories. A method could obtain the next input from each unrolled tree from an interator for that tree. I could then compare the hash codes and skip ahead on one tree or another, note deletions, and note modifications.

Steve Emmerson
Ok, I like this idea - have you got any links to example code employing this technique?
flesh
+1  A: 

The paper "Fast and Simple XML Tree Differencing by Sequence Alignment" by Lindholm, Kangasharju, and Tarkoma has some pointers:

1) rsync does the sort of thing you are interested in. Have a look at http://samba.anu.edu.au/ftp/rsync/rsync.html, and it might be worth checking to see if rsync --list-only does what it sounds like.

2) One trick is to turn the tree hierarchy into a sequence, by traversing it with a depth first search, and then compare the two sequences. Your idea about comparing hash codes can then be implemented by using a rolling hash (http://en.wikipedia.org/wiki/Rolling_hash).

I suspect that you will end up generating two entire sequences and then running some equivalent of diff or xdelta between them, rather than trying to do the job incrementally. A completely incremental approach might have problems when some sub-directory is moved a long way in the tree structure.

mcdowella
thanks for the answer - would you mind expanding a bit on diff or xdelta?thanks
flesh
The paper uses diff and xdelta in two ways: it uses them as a source for ideas about clever ways to find matches and it also uses the idea of matching up two sequences and then translating that back to the trees the sequences were generated from. E.g. if the stretches generated from subtrees A and B are matched as sequences then you match subtrees A and B. Thinking about this, I wonder about building a suffix array from one sequence and using that to move along the other sequence, at each point finding the point in the first sequence with the longest matching subsequence.
mcdowella