Directory contents diff

views:

answers:

Directory contents diff

I'm looking for existing ideas / solutions to the problem of finding differences between two directories. Specifically how to identify files that might have been changed, renamed and moved.

A short list of things I've considered:

try to pair up files missing in dir A with new files in dir b by using some heuristic such as 75% match in content. This just doesn't seem robust enough (problem cases include: significant changes in content, compression or encryption, possible multiple matches)
use alternative data streams to add an id to each file. This would work only on NTFS.
add a header/footer to each file containing and id. There's no way to guarantee header/footer will not corrupt the file.
ask for user input for each change to determine if file is indeed deleted or simply moved. This is too hard on user.
require user to rename/move files only by using special commands which will keep track of such changes. This is too hard on user.
setting up a file system watcher to catch changes on the fly. Several issues (watcher must run at all times, is platform specific...)

Any ideas welcome...

Why don't you simply calculate MD5/SHA-1 or oher hash calculation on the folder content?

http://en.wikipedia.org/wiki/MD5

Build a list of files/folders for A and B. Compare which are present in A but not in B. Compare which are present in B nut not in A. For those which are present in both A and B perform a hash calculation.

Gad D Lord 2009-12-01 09:35:09

Perhaps I wasn't too clear - I'd like to keep track of those changes not simply find out if two directories are different.

Goran 2009-12-01 09:43:35

+1 A:

A possible, not perfect, solution would be a version control system such as svn or git. that way, all change history is available. But users have to use specific commands.

mouviciel 2009-12-01 09:35:57

for content match i recommend using some sort of distributed version control system such as git

it can pretty much detect all file operations such as copies, moves, renames, …

knittl 2009-12-01 09:37:46

How does git detect that robustly?

Goran 2009-12-01 09:42:48

it tracks contents and not files, so it can precisely detect changes in content, even over files. and if a file gets removed and another file with the same content gets added, then that's pretty much a rename

knittl 2009-12-01 10:45:44

Do you have any insight into how it accomplishes that? It can't possibly compare such large texts directly...

Goran 2009-12-01 12:40:22

what _large_ texts? what size are we talking about here? but basically it only checks for modifications if the hashes of objects/files changed

knittl 2009-12-01 13:06:52

So if I rename, move and change file - will it be able to detect this "new" file is a derivative of the old one?

Goran 2009-12-01 13:21:40

yes it will. if you want to inspect history with `git log` be sure to add `-C -M --follow` options, so you actually follow the history even over file rename boundaries.

knittl 2009-12-01 13:29:38

Got me going in the right direction, thanks :)

Goran 2009-12-11 10:41:45

ansaurus

tags:

views:

answers:

Directory contents diff

related questions