Recursive N-way merge/diff algorithm for directory trees?

Goals:

Find pairs of directories that have many similar files between them.

Generate short list of directory pairs that can be synchronized with 2-way merge to eliminate duplicates

Should operate recursively (there may be nested duplicates of higher-level directories)

Run time and storage should be O(n log n) in numbers of directories and files

Should be able to use an embedded DB or page to disk for processing more files than fit in memory (100,000+).

Optional: generate an ancestry and change-set between folders

Optional: sort the merge operations by how many duplicates they can elliminate

I know how to use hashes to find duplicate files in roughly O(n) space, but I'm at a loss for how to go from this to finding partially overlapping sets between folders and their children.

EDIT: some clarification The tricky part is the difference between "exact same" contents (otherwise hashing file hashes would work) and "similar" (which will not). Basically, I want to feed this algorithm at a set of directories and have it return a set of 2-way merge operations I can perform in order to reduce duplicates as much as possible with as few conflicts possible. It's effectively constructing an ancestry tree showing which folders are derived from each other.

The end goal is to let me incorporate a bunch of different folders into one common tree. For example, I may have a folder holding programming projects, and then copy some of its contents to another computer to work on it. Then I might back up and intermediate version to flash drive. Except I may have 8 or 10 different versions, with slightly different organizational structures or folder names. I need to be able to merge them one step at a time, so I can chose how to incorporate changes at each step of the way.

This is actually more or less what I intend to do with my utility (bring together a bunch of scattered backups from different points in time). I figure if I can do it right I may as well release it as a small open source util. I think the same tricks might be useful for comparing XML trees though.

It seems desirable just to work on the filenames and sizes (and timestamps if you find that they are reliable), to avoid reading in all those files and hashing or diffing them.

Here's what comes to mind.

Load all the data from the filesystem. It'll be big, but it'll fit in memory.
Make a list of candidate directory-pairs with similarity scores. For each directory-name that appears in both trees, score 1 point for all pairs of directories that share that name. For each filename that appears in both trees (but not so often that it's meaningless), score 1 point for all pairs of directories that contain a file with that name. Score bonus points if the two files are identical. Score bonus points if the filename doesn't appear anywhere else. Each time you give points, also give some points to all ancestor-pairs, so that if a/x/y/foo.txt is similar to b/z/y/foo.txt, then the pairs (a/x/y, b/z/y) and (a/x, b/z) and (a, b) all get points.
Optionally, discard all pairs with scores too low to bother with, and critically examine the other pairs. Up to now we've only considered ways that directories are similar. Look again, and penalize directory-pairs that show signs of not having common ancestry. (A general way to do this would be to calculate the maximum score the two directories could possibly have, if they both had all the files and they were all identical; and reject the pair if only a small fraction of that possible score was actually achieved. But it might be better to do something cheap and heuristic, or to skip this step entirely.)
Choose the best-scoring candidate directory-pair. Output it. Eliminate those directories and all their subdirectories from contention. Repeat.

Choosing the right data structures is left as an exercise.

This algorithm makes no attempt to find similar files with different filenames. You can do that across large sets of files using something like the rsync algorithm, but I'm not sure you need it.

This algorithm makes no serious attempt to determine whether two files are actually similar. It just scores 1 point for the same filename and bonus points for the same size and timestamp. You certainly could diff them to assign a more precise score. I doubt it's worth it.

ansaurus

tags:

views:

answers:

Recursive N-way merge/diff algorithm for directory trees?

Goals:

related questions