ansaurus

Question

Determining whether a file is a duplicate

Answer 1

+4 A:

1) Very reliable
2) Not theoretically
3) SHA-1

zaf 2010-05-11 17:21:49

Shouldn't 2) be "Not in practice" or "Theoretically"? Reliability certainly increases in theory.

IVlad 2010-05-11 17:24:33

Ah, you mean he meant having several checksums? Like have a sha1 and md5?

zaf 2010-05-11 17:29:15

@zaf: yes, at least I'm hoping he meant that :).

IVlad 2010-05-11 17:30:43

@IVlad I suppose "Not theoretically" still applies then ;)

zaf 2010-05-11 17:39:19

@IVlad is correct, there is a small chance two files could have the same checksum, and using multiple checksums decreases that probability, so, theoretically that would increase reliability. Checksums are so reliable in practice, however, that this is not necessary. Also, CRC32 makes a better choice for this application: we are not concerned with malicious input, and it is much faster than SHA1.

BlueRaja - Danny Pflughoeft 2010-05-11 17:43:17

@BlueRaja: When using CRC32, the likeliness of two random files having the same checksum is 1 to 4294967296. When using SHA-1, it is above 1 to 1.46 * 10^48. If you'd compare the CRC32 checksums of all files you ever have seen or will see in your life, at least two files will have the same checksum. For SHA-1, you don't live long enough for this to happen ;-) However, same would hold true for MD5 already and MD5 is somewhat faster than SHA-1.

Mecki 2010-05-11 22:43:37

@BlueRaja: Run a program to delete all duplicates using CRC32 in a huge set of files and you can kiss your data goodbye.

Longpoke 2010-05-11 23:20:47

@Mecki and @Longpoke: When doing this in the real world, you use CRC32 to find duplicates; that is what is was made for. Not only is calculating a CRC32 significantly faster than calculating a SHA1 hash, but comparing two CRC32 outputs (which you will have to do *a lot*) is also *significantly* faster, since CRC32 fits into (on most processors) a single register. However, you do not rely solely on the checksum, when using CRC32 or SHA1 or anything else: on the rare occasion that two files have the same checksum, you should still do a byte-by-byte comparison of the files.

BlueRaja - Danny Pflughoeft 2010-05-12 00:05:40

@BlueRaja: Could it be that you got stuck in the 90th? ;-) Every application I know that has been written this century uses at least MD5 for this purpose. Considering that a modern CPU can MD5 500 MB of data a second (using a single core only!), most hard drives cannot even read that much data a second. Also MD5 has a much better bit distribution than CRC, that means it's much more likely that two files are considered different after only comparing the first two byte of a MD5 checksum than it is for a CRC checksum. A CRC checksum might only be different at the last byte.

Mecki 2010-05-12 09:25:47

Answer 2

A:

Any standard checksum algorithm as MD5 will give you a reliable test, for most real life scenarios. If you need even more reliability, go SHA. http://en.wikipedia.org/wiki/Cryptographic_hash_function#Cryptographic_hash_algorithms

leonbloy 2010-05-11 17:24:20

Answer 3

+6 A:

It's impossible to know with certainty whether or not two files are the same unless you compare them byte for byte. It's similar to how you can't guarantee that a collection does or doesn't contain a given object unless you check every item in the collection.

Checksums are basically a hash. Whether they're good enough for your purposes depends on how mission-critical your app is. It's certainly possible to create a hash function with low risk of collision; after all, passwords are hashed, even in situations where they protect sensitive data and you wouldn't want to have a second valid password on your account. Unless you're writing code for, say, a bank, a strong checksum algorithm should provide a very good approximation.

Using multiple checksums will increase reliability if and only if the different checksum algorithms use dissimilar hash functions.

Your third question has already been taken care of by leonbloy's answer; MD5 and SHA-1 are common.

Lord Torgamus 2010-05-11 17:33:13

-1 for clear confusion between hash and checksum

BlueRaja - Danny Pflughoeft 2010-05-11 17:46:14

@BlueRaja, how so?

Lord Torgamus 2010-05-11 17:50:55

`Checksums are basically a hash.` It's the other way around - hashes are basically checksums, but with more stringent requirements. `It's certainly possible to create a hash function with low risk of collision` Hashes are designed to have as low a risk of collision as statistically possible. Anything else is simply not a hash. `a strong checksum algorithm should provide a very good approximation [of a hash]` Hashes and checksums are similar beasts with very different purposes. CRC32 is a great checksum, but a lousy hash. BCrypt is a great hash, but a lousy checksum (it is too slow).

BlueRaja - Danny Pflughoeft 2010-05-11 18:25:32

+1 to balance BlueRaja's "clear confusion." If you think about it a checksum function and a hash function are the same, the only difference is how you use the result.

Ukko 2010-05-11 18:43:03

No fair, Blue Raja's explanation was not there 2 minutes ago when I started the previous comment. Now in response your criticism is not essential to what a checksum or a hash is. Rather you are saying that only good hash functions are hashes and good checksums are checksums.

Ukko 2010-05-11 18:47:47

@BlueRaja, we're getting into semantics. It's perfectly legal, though stupid, to overwrite Java's `hashCode()` method to return the same value for every object. For the purposes of this question, I think it's reasonable to define hash as a manipulation of the input data into a relatively small, likely unique result. Similar to Ukko's comment.

Lord Torgamus 2010-05-11 19:06:38

@Lord: Ah, I see, we are using two separate definitions of "hash" - I was talking about [cryptographic hashes](http://en.wikipedia.org/wiki/Cryptographic_hash_function) (which has become the common meaning of the term "hash"), whereas you were talking about the the ["hash codes"](http://en.wikipedia.org/wiki/Hash_function) used by hash-tables (et al.). I am so used to discussing cryptographic hashes (other answers mentioned sha1 and md5, for example) I must have lost my head. If you edit your answer I will remove the downvote.

BlueRaja - Danny Pflughoeft 2010-05-11 19:37:25

@BlueRaja, edited. I was pretty confused at first, because I have no crypto background; it's interesting reading you semi-directly led me to. Thanks!

Lord Torgamus 2010-05-11 22:37:21

Answer 4

A:

Any checksum will give you a false positive for a very small number of cases. If you can live with that, fine. If not then the way to do this is to do the checksum comparison first, and if the checksums are equal then to a byte-by-byte test. The byte-by-byte test will be done very rarely, so the cost averaged over a lot of comparisons will be very small. HOWEVER this is not the case when most of your comparisons are expected to return 'true'.

It also depends on how many different files you are testing. Calcualting a high-reliability checksum is nearly as expensive as doing a comparison - if each file gets compared approximately once then it may be cheaper to do the comparisons.

DJClayworth 2010-05-11 18:04:36

ansaurus

tags:

views:

answers:

Determining whether a file is a duplicate

related questions