What algorithms or Java libraries are available to do N-way, recursive diff/merge of directories?
I need to be able to generate a list of folder trees that have many identical files, and have subdirectories with many similar files. I want to be able to use 2-way merge operations to quickly remove as much redundancy as possible.
Goals:
- Find pairs of directories that have many similar files between them.
- Generate short list of directory pairs that can be synchronized with 2-way merge to eliminate duplicates
- Should operate recursively (there may be nested duplicates of higher-level directories)
- Run time and storage should be O(n log n) in numbers of directories and files
- Should be able to use an embedded DB or page to disk for processing more files than fit in memory (100,000+).
- Optional: generate an ancestry and change-set between folders
- Optional: sort the merge operations by how many duplicates they can elliminate
I know how to use hashes to find duplicate files in roughly O(n) space, but I'm at a loss for how to go from this to finding partially overlapping sets between folders and their children.
EDIT: some clarification The tricky part is the difference between "exact same" contents (otherwise hashing file hashes would work) and "similar" (which will not). Basically, I want to feed this algorithm at a set of directories and have it return a set of 2-way merge operations I can perform in order to reduce duplicates as much as possible with as few conflicts possible. It's effectively constructing an ancestry tree showing which folders are derived from each other.
The end goal is to let me incorporate a bunch of different folders into one common tree. For example, I may have a folder holding programming projects, and then copy some of its contents to another computer to work on it. Then I might back up and intermediate version to flash drive. Except I may have 8 or 10 different versions, with slightly different organizational structures or folder names. I need to be able to merge them one step at a time, so I can chose how to incorporate changes at each step of the way.
This is actually more or less what I intend to do with my utility (bring together a bunch of scattered backups from different points in time). I figure if I can do it right I may as well release it as a small open source util. I think the same tricks might be useful for comparing XML trees though.