views:

111

answers:

4

To compare byte by byte surely works. But I am wondering if there are any other proven way, say some kind of hashing that outputs unique values for each file. And if there are, what are the advantages and disadvantage of each one in terms of time and memory footprint.

By the way, I found this previous thread http://stackoverflow.com/questions/784585/what-is-the-fastest-way-to-check-if-files-are-identical. However, My question is not about speed, but alternatives.

Please advise. Thanks.

+2  A: 

Anyway if your files are n bytes length, you have to compare n bytes, you can't make the problem simpler.

You can only gain speed on n comparisons when files are not identical, by checking length for exemple.

A hash is not a proven method because of collisions, and to make a hash you'll have to read n bytes on each file aswell.

If you want to compare the same file multiple times you can use hashing, then double check with a byte-to-byte

Lliane
+1  A: 

Hashing doesn't output 'unique' values. It can't possibly do so, because there are an infinite number of different files, but only a finite number of hash values. It doesn't take much thought to realise that to be absolutely sure two files are the same, you're going to have to examine all the bytes of both of them.

Hashes and checksums can provide a fast 'these files are different' answer, and within certain probabilistic bounds can provide a fast 'these files are probably the same' answer, but for certainty of equality you have to check every byte. How could there be any way round this?

AakashM
Downvoters please leave an explanatory comment.
AakashM
A: 

If you want to compare multiple files then SHA-1 hash algorithm is a very good choice.

Nick D
+3  A: 

The only proven way is to do a byte-by-byte compare. It's also the fastest way and you can cut the memory usage all the way down to 2 bytes if you read a byte at a time. Reading larger chunks at a time is beneficial for performance though.

Hashing will also work. Due to the pigeonhole principle there will be a small chance that you'll get false positives but for all intents and purposes it is negligible if you use a secure hash like SHA. Memory usage is also small, but performance is less than byte-by-byte compare because you'll have the overhead of hashing. Unless you can reuse the hashes to do multiple compares.

Ants Aasma
In particular, once you get the probability of a false match well below the probability of a cosmic ray flipping a bit in RAM and causing a false match for the byte-by-byte comparison, you can stop worrying about it. The real risk is not chance, but that your hash has been broken and someone maliciously introduces hash collisions into your data set. So definitely don't use MD5 as the hash, and with a view to the near future, don't use SHA-1 either.
Steve Jessop
Do you know of any estimates for the cosmic-ray probability? I imagine that there is only a relatively few number of bit locations that would have that effect, presenting a correspondingly tiny target for the cosmic ray to hit.
caf
Sorry, no, I confess I use it as a hyperbole, the point is there's no need to demand that software be many, many orders of magnitude more reliable than hardware. Suppose the chance of a hardware-provoked error happening to some computer, somewhere, at some point in the next 1000 years, is greater than 1%. I think that's exceedingly generous to hardware. Say 10^12 computers for 1000 years, requires a probability of at least 1 in 2^91 per ms of a miscalculation. I think Intel would settle for that, and the probability of a pair colliding on a 160 bit hash is of course much lower.
Steve Jessop
One 1bit error per month per GB of memory is about the rule of thumb. That's why servers use ECC memory.
Ants Aasma