views:

165

answers:

4

Is there a reliable way to determine whether or not two files are the same? For example, two files with the same size and type may or may not be the same binarilly (yeah, I know it's not really a word). I assume that comparing one or two checksums of the files will help, but I wonder:

  1. How reliable are checksums at determining whether two files are different; what are the chances of two different files having the same checksum?
  2. Would reliability increase by applying additional checksum comparisons?
  3. Which checksum algorithm(s) would be the most efficient and/or reliable?

Any ideas, suggestions or thoughts are appreciated!

P.S. The code for this is being written in Java running on a nix system, but generic or platform agnostic input is most helpful.

+4  A: 
1) Very reliable
2) Not theoretically
3) SHA-1
zaf
Shouldn't 2) be "Not in practice" or "Theoretically"? Reliability certainly increases in theory.
IVlad
Ah, you mean he meant having several checksums? Like have a sha1 and md5?
zaf
@zaf: yes, at least I'm hoping he meant that :).
IVlad
@IVlad I suppose "Not theoretically" still applies then ;)
zaf
@IVlad is correct, there is a small chance two files could have the same checksum, and using multiple checksums decreases that probability, so, theoretically that would increase reliability. Checksums are so reliable in practice, however, that this is not necessary. Also, CRC32 makes a better choice for this application: we are not concerned with malicious input, and it is much faster than SHA1.
BlueRaja - Danny Pflughoeft
@BlueRaja: When using CRC32, the likeliness of two random files having the same checksum is 1 to 4294967296. When using SHA-1, it is above 1 to 1.46 * 10^48. If you'd compare the CRC32 checksums of all files you ever have seen or will see in your life, at least two files will have the same checksum. For SHA-1, you don't live long enough for this to happen ;-) However, same would hold true for MD5 already and MD5 is somewhat faster than SHA-1.
Mecki
@BlueRaja: Run a program to delete all duplicates using CRC32 in a huge set of files and you can kiss your data goodbye.
Longpoke
@Mecki and @Longpoke: When doing this in the real world, you use CRC32 to find duplicates; that is what is was made for. Not only is calculating a CRC32 significantly faster than calculating a SHA1 hash, but comparing two CRC32 outputs (which you will have to do *a lot*) is also *significantly* faster, since CRC32 fits into (on most processors) a single register. However, you do not rely solely on the checksum, when using CRC32 or SHA1 or anything else: on the rare occasion that two files have the same checksum, you should still do a byte-by-byte comparison of the files.
BlueRaja - Danny Pflughoeft
@BlueRaja: Could it be that you got stuck in the 90th? ;-) Every application I know that has been written this century uses at least MD5 for this purpose. Considering that a modern CPU can MD5 500 MB of data a second (using a single core only!), most hard drives cannot even read that much data a second. Also MD5 has a much better bit distribution than CRC, that means it's much more likely that two files are considered different after only comparing the first two byte of a MD5 checksum than it is for a CRC checksum. A CRC checksum might only be different at the last byte.
Mecki
A: 

Any standard checksum algorithm as MD5 will give you a reliable test, for most real life scenarios. If you need even more reliability, go SHA. http://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms

leonbloy
+6  A: 

It's impossible to know with certainty whether or not two files are the same unless you compare them byte for byte. It's similar to how you can't guarantee that a collection does or doesn't contain a given object unless you check every item in the collection.

Checksums are basically a hash. Whether they're good enough for your purposes depends on how mission-critical your app is. It's certainly possible to create a hash function with low risk of collision; after all, passwords are hashed, even in situations where they protect sensitive data and you wouldn't want to have a second valid password on your account. Unless you're writing code for, say, a bank, a strong checksum algorithm should provide a very good approximation.

Using multiple checksums will increase reliability if and only if the different checksum algorithms use dissimilar hash functions.

Your third question has already been taken care of by leonbloy's answer; MD5 and SHA-1 are common.

Lord Torgamus
-1 for clear confusion between hash and checksum
BlueRaja - Danny Pflughoeft
@BlueRaja, how so?
Lord Torgamus
`Checksums are basically a hash.` It's the other way around - hashes are basically checksums, but with more stringent requirements. `It's certainly possible to create a hash function with low risk of collision` Hashes are designed to have as low a risk of collision as statistically possible. Anything else is simply not a hash. `a strong checksum algorithm should provide a very good approximation [of a hash]` Hashes and checksums are similar beasts with very different purposes. CRC32 is a great checksum, but a lousy hash. BCrypt is a great hash, but a lousy checksum (it is too slow).
BlueRaja - Danny Pflughoeft
+1 to balance BlueRaja's "clear confusion." If you think about it a checksum function and a hash function are the same, the only difference is how you use the result.
Ukko
No fair, Blue Raja's explanation was not there 2 minutes ago when I started the previous comment. Now in response your criticism is not essential to what a checksum or a hash is. Rather you are saying that only good hash functions are hashes and good checksums are checksums.
Ukko
@BlueRaja, we're getting into semantics. It's perfectly legal, though stupid, to overwrite Java's `hashCode()` method to return the same value for every object. For the purposes of this question, I think it's reasonable to define hash as a manipulation of the input data into a relatively small, likely unique result. Similar to Ukko's comment.
Lord Torgamus
@Lord: Ah, I see, we are using two separate definitions of "hash" - I was talking about [cryptographic hashes](http://en.wikipedia.org/wiki/Cryptographic_hash_function) (which has become the common meaning of the term "hash"), whereas you were talking about the the ["hash codes"](http://en.wikipedia.org/wiki/Hash_function) used by hash-tables (et al.). I am so used to discussing cryptographic hashes (other answers mentioned sha1 and md5, for example) I must have lost my head. If you edit your answer I will remove the downvote.
BlueRaja - Danny Pflughoeft
@BlueRaja, edited. I was pretty confused at first, because I have no crypto background; it's interesting reading you semi-directly led me to. Thanks!
Lord Torgamus
A: 

Any checksum will give you a false positive for a very small number of cases. If you can live with that, fine. If not then the way to do this is to do the checksum comparison first, and if the checksums are equal then to a byte-by-byte test. The byte-by-byte test will be done very rarely, so the cost averaged over a lot of comparisons will be very small. HOWEVER this is not the case when most of your comparisons are expected to return 'true'.

It also depends on how many different files you are testing. Calcualting a high-reliability checksum is nearly as expensive as doing a comparison - if each file gets compared approximately once then it may be cheaper to do the comparisons.

DJClayworth