views:

169

answers:

2

I have a part of my application that stores files. Because we could potentially be adding many of the same file, I am first keeping a hash of each file. If two files have the same hash, then we throw out one, and both "references" to that file point to the same physical file.

  1. How much should I be worried about hash collisions?

  2. In the case of a collision what should I do? The whole crux of my code so far depends on there not being two different files with the same hash. In the event of a collision right now, my app would throw out a legitmately different file and point to the file with the same hash.

  3. Should I be using something other than MD5? Does SHA-1 have a better collision rate?

+3  A: 

Unless you're in some really REALLY critical application, do not worry about hash collisions. They are so rare that many things assume they are not going to happen, and catastrophic things will happen to these things if that assumption ends up being false just once.

SHA1 has a larger output space than MD5 (and fewer attacks are known on it, too), so it's definitely not a worse choice. If you are afraid of someone actively colliding your hashes, perhaps a later variant of SHA, such as SHA-256, might be a good idea.

Jan Krüger
An SHA-1 collision is likely to be demonstrated quite soon (http://www.schneier.com/blog/archives/2009/06/ever_better_cry.html). SHA-256 is safer, and make sure the code is designed to let you move to a new hash easily in future.
Steve Jessop
+2  A: 

The chance of a collision between the hashes of any two randomly selected bitstreams is the inversely proportional to the number of distinct states that the hash represents. So a 64 bit hash encodes 2 ** 64 states and has a chance of 1 / (2**64) of a collision for any pair of files. But you are really concerned with the chance of a collisions over a (large) set of files, so you need to do the "birthday paradox" calculation, plugging the probability of a pairwise collision and the expected number of files.

But I think that the bottom line is that throwing away a file without doing a comparison is an unsafe thing to do, even if the numbers say that the probability of a collision is small.

Stephen C
SHA-1 is a 160 bit hash. There are other failure modes which occur with greater probability in the case where one is storing 2^80 files ;-) The genuine danger is someone maliciously supplying two colliding files, or coercing the program into generating two colliding files.
Steve Jessop