views:

156

answers:

4

I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).

Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

+1  A: 

Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.

Max Shawabkeh
I was considering using a hash in combination with file size - that way, in the unlikely even of a collision, I'd check the file size as an additional key to see if it was really the same file.
rwmnau
I understand what you are aiming for, but my point is that instead of taking extra N bits to store a filesize, you should simply take a hash function whose hash is N bits larger than your current one. It is much more likely to produce fewer collisions that way, since filesize is arbitrary, while hash functions are specifically designed to avoid collisions, therefore these extra bits will be better utilized that way.
Max Shawabkeh
Ah - that makes sense. I figured I'd be better of picking a "larger" hash function anyways, so maybe that's what I'll end up doing.
rwmnau
A: 

Hash functions are designed the way that it's very difficult to get the collision, otherwise they won't be effective.
If you have hash collision that is absolutely unbelievable about 1 : number_of_possible_hashes probability that says nothing about file size.

If you really want to be double-sure about hash collisions, you can calculate two different hashes for the same file - it will be more error-prone than saving hash + file size.

Li0liQ
I was actually considering doing this - see my other question, http://stackoverflow.com/questions/2437345/tracking-unique-versions-of-files-with-hashes. I figured that saving two hashes (like SHA1 and MD5), as well as the filesize, will make collisions so astronomically unlikely that I never have to worry about it.
rwmnau
Pretend you are using sha256 which gives you 2^256 possible hash values and you have billion of files with million versions each that is 1 000 000 000 * 1 000 000 approximating to 2^50 so you are ending up with average of 2^200 possible hash values for each file without any collision threat. Isn't it immense? To be more precise you can try evaluating the probability of hash collision by calculating `1 - ((2^256)! / ((2^256) - 10^15)! ) / ((2^256)^(10^15))` or if not so accurate `1 - (1 - (10^15)/(2*2^256))^(10^15)` that will give you `4e-48` chance of collision.
Li0liQ
A: 

The size of the hash is the same regardless of the size of the original data. As there is only a limited number of possible hashes it is theoretically possible that two files with different sizes may have the same hash. However, this means that it is also possible that two files with the same size may have the same hash.

Ignacio Vazquez-Abrams
A: 

The whole point of the family of cryptographic hashes (MD5, SHA-x, etc) is to make collisions vanishingly unlikely. The notion is that official legal processes are prepared to depend on it being impractical to manufacture a collision on purpose. So, really, it's a bad use of space and CPU time to add a belt to the suspenders of these hashes.

bmargulies