views:

1701

answers:

14

If you have 1,000,0000 source files, you suspect they are all the same, and you want to compare them what is the current fasted method to compare those files? Assume they are Java files and platform where the comparison is done is not important. cksum is making me cry. When I mean identical I mean ALL identical.

Update: I know about generating checksums. diff is laughable ... I want speed.

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

Update: read the number of blocks rather than bytes? Immediatly throw out those? Is that faster than finding the number of bytes?

Update: Is this ANY different than the fastest way to compare two files?

+1  A: 

Using cksum is not as reliable as using something like md5sum. But I would opt for maximum reliability, which means a byte-by-byte comparison using cmp.

You have to read every byte in both files for all checking methods so you may as well opt for the one that is most reliable.

As a first pass, you could check the directory listing to see if the sizes are different. That's a quick way to get faster feedback for different files.

paxdiablo
Do you really have to read every byte of every file? Can you stop reading if one file falls out of favor?
ojblass
Yes, you can stop when you find a difference, that's the *big* advantage of a cmp solution to any cksum/md5sum/any-checksum solution. For identical files, you have to read the lot (even for cmp).
paxdiablo
A: 

MD5 hash would be faster than comparison, but slower than a normal CRC-check. You have to figure out the kind of reliability you want in comparison.

Sandy
Why would MD5 be faster than comparison? They both have to read all bytes in both files.
paxdiablo
In fact, comparison should be faster, because it can abort once it hits a different byte. And check the file size first.
Thilo
This (aborting after the first different byte) was exactly the idea I had, but it turned out to be wrong, at least when working with large files! Se my answer.
Thomas Padron-McCarthy
@Thomas: Your testing must have been flawed - see my comments on your answer.
Software Monkey
@Software Monkey: Maybe we misunderstand each other, and mean different things, but the tests I ran are clear: Reading two files twice is sometimes faster than reading them just once! I actually re-ran a simplified version of the test now, and (on one particular computer, with a particular pair of files) it was almost twice as fast to read the two files twice than to read them just once!
Thomas Padron-McCarthy
If you read the file once, note the time, then read the file twice, then note the time. The last to reads are faster because the file is cached. It can't be faster to read a file twice (under the same conditions) than once.
BeWarned
@BeWarned: Check my updated answer, and the link.
Thomas Padron-McCarthy
+1  A: 

Well the most optimal algorithm will depend on the number of duplicate files.

Assuming a few are the same but most are different and the files are big.

Filter out the ones that are obviously not the same using a simple file length check.

Choose random bytes from the file, calculate a hash and compare (minimizing disk seeks)

Follow that with a full file SHA1.

Sam Saffron
picking a few random bits... very bright...
ojblass
File sizes from the directory is a good idea, but random bytes is useless since you still have to compare the whole file to be reliable. And any sort of checksum is going to be slower than a buffered byte-by-byte compare.
paxdiablo
How about something like looking at the last byte of all one million of them?
ojblass
and going down the chain...
ojblass
random bytes can be faster if you have a ton of files with the same size that are different, you can use the random bytes to avoid having to calculate a full hash
Sam Saffron
Random bytes is an unreliable method - you can check all but one byte in the two files and still not be 100% sure that they're identical. You have to check every byte and, as long as you're having to do that, buffered sequential reads will outperform random access.
paxdiablo
Where you're coming from, I believe, is the same early-exit strategy if you find a difference in the file. But a cmp solution already has that.
paxdiablo
@pax thats true, I guess you can have a slight advantage if a ton of files have the same header with a random approach ... its hard to engineer a solution if you do not know the exact problem. cmp approach is KISS so its good
Sam Saffron
+1  A: 

I would run something like this

find -name \*.java -print0 | xargs -0 md5sum | sort

then see which files have different MD5 sums. This will group the files by checksum.

You can replace md5sum which sha1sum or even rmd160 if you like.

Blair Zajac
and you still need to manually compare the 1,000,000 pairs of md5
Thilo
The first step should be to group the files by file size! After that, MD5 for all groups of at least two files of equal length may very will be the fastest method. (And don't compare the checksums manually - you have a computer for that!)
Thomas Padron-McCarthy
+7  A: 

I'd opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you've checked all files. I don't think there's any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

OP Modification: Lifted up important comment from Mark Bessey

"another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once."

David Zaslavsky
I think compare has a setting to give up one a difference is found.
ojblass
I wonder whether the file system could give you cheap checksums if one used something like ZFS.
Thilo
move it to a ZFS file system... can you expand on that?
ojblass
I cannot speak on ZFS, just thinking out load. But ZFS has built-in integrity check sums for all files (they are also working on de-duplication). I wonder if those are exposed to userland, so that one could check efficiently if files are identical.
Thilo
You may want to compare sizes before reading the files.
Brian Rasmussen
another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once.
Mark Bessey
A: 

Why reinvent the wheel? How about a third party app? Granted it doesn't have APIs but I don't imagine you put your self in this situation often. I like this app doublekiller just make a backup before you start. :) It's fast and free!

NitroxDM
A: 

First compare the file lengths of all million. If you have a cheap way to do so, start with the largest files. If they all pass that then compare each file using a binary division pattern; this will fail faster on files that are similar but not the same. For information on this method of comparison see Knuth-Morris-Pratt method.

Peter Wone
+1  A: 

I don't think hashing is going to be faster than byte by byte comparisons. The byte by byte comparison can be optimized a bit by pipelining the reading and comparision of the bytes, also multiple sections of the file could be compared in parallel threads. It would be go something like this:

  • Check if the files sizes differ
  • Read blocks of the files into memory asynchronously
  • Handle them off to worker threads to do the comparisons

Or just run a cmp's (or the equivalent for your OS) in parallel. This could be scripted easily and you still get the benefit of parallelism.

BeWarned
Isn't the disk bus the bottle neck here? Maybe dispersing the files... You are on to something here...
ojblass
Remember that the physical disk is the bottleneck! When I tried byte by byte comparisons it turned out to be slower than checksums, because of the physical properties of disks. When you calculate a checksum for a file, you read blocks sequentially from the disk (assuming not too much fragmentation). When you compare files directly, and read them in parallel, you have to read blocks all over the disk.
Thomas Padron-McCarthy
"Remember that the physical disk is the bottleneck! When I tried byte by byte comparisons .."You would of course read reasonable large chunks of the file at once. Small files would in effect by slurped completely.Having the two sets of files on two different disk would be ideal.
Thilo
It really depends how your disks are arranged. You can spread your files across different devices so that the i/o for the files are on different spindles and individual files are likely to be contiguous.You can even copy your files to a ramdisk and do the comparisons there.
BeWarned
After re-reading the question, if you are continually comparing the output of a program then you only need to compare the output of the last two runs.
BeWarned
+1  A: 

There are a number of programs that compare a set of files in general to find identical ones. FDUPES is a good one: Link. A million files shoudln't be a a problem, depending on the exact nature of the input. I think that FDUPES requires Linux, but there are other such programs for other platforms.

I tried to write a faster program myself, but except for special cases, FDUPES was faster.

Anyway, the general idea is to start by checking the sizes of the files. Files that have different sizes can't be equal, so you only need to look at groups of files with the same size. Then it gets more complicated if you want optimal performance: If the files are likely to be different, you should compare small parts of the files, in the hope of finding differences early, so you don't have to read the rest of them. If the files are likely to be identical, though, it can be faster to read through each file to calculate a checksum, because then you can read sequentially from the disk instead of jumping back and forth between two or more files. (This assumes normal disks, so SSD:s may be different.)

In my benchmarks when trying to make a faster program it (somewhat to my surprise) turned out to be faster to first read through each file to calculate a checksum, and then if the checksums were equal, compare the files directly by reading a blocks alternately from each file, than to just read blocks alternately without the previous checksum calculations! It turned out that when calculating the checksums, Linux cached both files in main memory, reading each file sequentially, and the second reads were then very fast. When starting with alternating reads, the files were not (physically) read sequentially.

EDIT:

Some people have expressed surprise end even doubt that it could be faster to read the files twice than reading them just once. Perhaps I didn't manage to explain very clearly what I was doing. I am talking about cache pre-loading, in order to have the files in disk cache when later accessing them in a way that would be slow to do on the physical disk drive. Here is a web page where I have tried to explain more in detail, with pictures, C code and measurements.

However, this has (at best) marginal relevance to the original question.

Thomas Padron-McCarthy
???? Thats a little crazy.
ojblass
Yes. I was a bit annoyed, actually.
Thomas Padron-McCarthy
Even if the files were cached, it *must* take longer to read all the bytes to hash them and then reprocess them to compare than it would to just read and compare them to start with. As well, the compare can abort on first mismatch. Therefore, your test was faulty or your measurement of the time taken was.
Software Monkey
Given two files of 1,000,000 bytes, doing 1,000,000 x "if(chr1!=chr2)" must be faster than 1,000,000 x "hash.update(ch1)" plus 1,000,000 x "hash.update(ch2)", even if the hashing is optimized by using an update() function which takes an array or pointer.
Software Monkey
@Thomas: What buffer size did you use for reading the files for the comparison?
Software Monkey
The read size was 4096 bytes. I did consider a version that increased the buffer size exponentially for later reads, but never implemented it. And now I think the idea from another answer, with reads in random places of the files, and then a full checksum, will be better.
Thomas Padron-McCarthy
The first two comments above by Software Monkey: Remember the caching. Reading two files sequentially into memory from the physical disk can be faster than reading them both in parallel, alternating between them (moving the read head back and forth). Everything you do later, with all the data cached in memory, is relatively much faster. But yes, it depends on the data, and this is an average. Two files that actually do differ in the beginning will be faster to compare byte by byte.
Thomas Padron-McCarthy
@Thomas: I can see how that difference could yield a surprising jump in performance. Good point.
Software Monkey
@Thomas: I must admit, I would be thinking more along the lines of 1MB buffers, not 4K.
Software Monkey
+1  A: 

beyond compare, sync two folders, super fast! we use it all the time, everyday.

bo
+5  A: 

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

if you have control over the output have the program creating the files / output create an md5 on the fly and embed it in the file or output stream or even pipe the output through a program that creates the md5 along the way and stores it along side the data somehow, point is to do the calculations when the bytes are already in memory.

if you can't pull this off then like others have said, check file sizes then do a straight byte by byte comparison on same sized files, i don't see how any sort of binary division or md5 calculation is any better than a straight comparison, you will have to touch every byte to prove equality any way you cut it so you might as well cut the amount of computation needed per byte and gain the ability to cut off as soon as you find a mis-match.

the md5 calculation would be useful if you plan to compare these again later to new outputs but your basically back to my first point of calculating the md5 as soon as possible

Great idea if the output generation can be controlled.
Michael Burr
+3  A: 

Assuming that the expectation is that the files will be the same (it sound like that's the scenario), then dealing with checksums/hashes is a waste of time - it's likely that they'll be the same and you'd have to re-read the files to get the final proof (I'm also assuming that since you want to "prove ... they are the same", that having them hash to the same value is not good enough).

If that's the case I think that the solution proposed by David is pretty close to what you'd need to do. A couple things that could be done to optimize the comparison, in increasing level of complexity:

  • check if the file sizes are the same before doing the compare
  • use the fastest memcmp() that you can (comparing words instead of bytes - most C runtimes should do this already)
  • use multiple threads to do the memory block compares (up to the number of processors available on the system, going over that would cause your thread to fight each other)
  • use overlapped/asynchronous I/O to keep the I/O channels as busy as possible, but also profile carefully so you thrash between the files as little as possible (if the files are divided among several different disks and I/O ports, all the better)
Michael Burr
+4  A: 

Most people in their responses are ignoring the fact that the files must be compared repeatedly. Thus the checksums are faster as the checksum is calculated once and stored in memory (instead of reading the files sequentially n times).

Doug Bennett
Grrr of course you are right...
ojblass
A: 

I have just written a c# app that does something similar to what you want. What my code does is this.

Read all off the sizes of each file into a list or array.

Use a for loop to check if any of these sizes are the same. if they are the same size, compare a byte of one file to a byte of the other file. If the two bytes are the same, move onto the next byte. If a difference is found, return that the files are different.

If the end of both files is reached, and the final two bytes are the same, the files must be identical.

I have experimented with comparing MD5 hashes of files rather than going through byte for byte, and I have found that identical files are often missed with this method, however it is significantly faster.

Ryan