views:

57

answers:

3

I'm looking for existing ideas / solutions to the problem of finding differences between two directories. Specifically how to identify files that might have been changed, renamed and moved.

A short list of things I've considered:

  • try to pair up files missing in dir A with new files in dir b by using some heuristic such as 75% match in content. This just doesn't seem robust enough (problem cases include: significant changes in content, compression or encryption, possible multiple matches)
  • use alternative data streams to add an id to each file. This would work only on NTFS.
  • add a header/footer to each file containing and id. There's no way to guarantee header/footer will not corrupt the file.
  • ask for user input for each change to determine if file is indeed deleted or simply moved. This is too hard on user.
  • require user to rename/move files only by using special commands which will keep track of such changes. This is too hard on user.
  • setting up a file system watcher to catch changes on the fly. Several issues (watcher must run at all times, is platform specific...)

Any ideas welcome...

A: 

Why don't you simply calculate MD5/SHA-1 or oher hash calculation on the folder content?

http://en.wikipedia.org/wiki/MD5

Build a list of files/folders for A and B. Compare which are present in A but not in B. Compare which are present in B nut not in A. For those which are present in both A and B perform a hash calculation.

Gad D Lord
Perhaps I wasn't too clear - I'd like to keep track of those changes not simply find out if two directories are different.
Goran
+1  A: 

A possible, not perfect, solution would be a version control system such as svn or git. that way, all change history is available. But users have to use specific commands.

mouviciel
A: 

for content match i recommend using some sort of distributed version control system such as git

it can pretty much detect all file operations such as copies, moves, renames, …

knittl
How does git detect that robustly?
Goran
it tracks contents and not files, so it can precisely detect changes in content, even over files. and if a file gets removed and another file with the same content gets added, then that's pretty much a rename
knittl
Do you have any insight into how it accomplishes that? It can't possibly compare such large texts directly...
Goran
what _large_ texts? what size are we talking about here? but basically it only checks for modifications if the hashes of objects/files changed
knittl
So if I rename, move and change file - will it be able to detect this "new" file is a derivative of the old one?
Goran
yes it will. if you want to inspect history with `git log` be sure to add `-C -M --follow` options, so you actually follow the history even over file rename boundaries.
knittl
Got me going in the right direction, thanks :)
Goran