views:

413

answers:

2

I'd like to find data deduplication algorithms, mostly to find duplicate files. Looks like the first step is to identify the files with the same timestamps, sizes and file names. I can do the md5 chechsum on those files and compare. Addition to that it is possible to compare the contents of files. What else should I watch for?

+1  A: 

There are products available for this. Look for Duplicate File Detective. It can match by name, timestamp, md5 and other algorithms

jvanderh
+1  A: 

You have OS meta-information (size and timestamps). Other meta-information includes permissions. You could compare inode and dnode information, but that doesn't mean much.

You have a summary (checksum).

You have byte-by-byte details.

What else could there be? Are you asking for other summaries? A summary is less informative than the byte-by-byte details. But you could easily invent lots of other summaries. A summary is only useful if you save it somewhere so you don't recompute it all the time.

If you want to save summaries for the "master" copy, you can invent any kind of summary you want. Line counts, letter "e" counts, average line length, anything is an potentially interesting summary.

S.Lott