Are hash collisions with different file sizes just as likely as same file size?

views:

156

answers:

Are hash collisions with different file sizes just as likely as same file size?

I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).

Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?

+1 A:

Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.

Max Shawabkeh 2010-03-14 15:39:00

I was considering using a hash in combination with file size - that way, in the unlikely even of a collision, I'd check the file size as an additional key to see if it was really the same file.

rwmnau 2010-03-14 16:34:32

I understand what you are aiming for, but my point is that instead of taking extra N bits to store a filesize, you should simply take a hash function whose hash is N bits larger than your current one. It is much more likely to produce fewer collisions that way, since filesize is arbitrary, while hash functions are specifically designed to avoid collisions, therefore these extra bits will be better utilized that way.

Max Shawabkeh 2010-03-14 17:46:21

Ah - that makes sense. I figured I'd be better of picking a "larger" hash function anyways, so maybe that's what I'll end up doing.

rwmnau 2010-03-14 23:45:20

Hash functions are designed the way that it's very difficult to get the collision, otherwise they won't be effective.
If you have hash collision that is ~~absolutely unbelievable~~ about 1 : number_of_possible_hashes probability that says nothing about file size.

If you really want to be double-sure about hash collisions, you can calculate two different hashes for the same file - it will be more error-prone than saving hash + file size.

Li0liQ 2010-03-14 15:39:40

I was actually considering doing this - see my other question, http://stackoverflow.com/questions/2437345/tracking-unique-versions-of-files-with-hashes. I figured that saving two hashes (like SHA1 and MD5), as well as the filesize, will make collisions so astronomically unlikely that I never have to worry about it.

rwmnau 2010-03-14 15:42:32

Pretend you are using sha256 which gives you 2^256 possible hash values and you have billion of files with million versions each that is 1 000 000 000 * 1 000 000 approximating to 2^50 so you are ending up with average of 2^200 possible hash values for each file without any collision threat. Isn't it immense? To be more precise you can try evaluating the probability of hash collision by calculating `1 - ((2^256)! / ((2^256) - 10^15)! ) / ((2^256)^(10^15))` or if not so accurate `1 - (1 - (10^15)/(2*2^256))^(10^15)` that will give you `4e-48` chance of collision.

Li0liQ 2010-03-14 17:47:11

The size of the hash is the same regardless of the size of the original data. As there is only a limited number of possible hashes it is theoretically possible that two files with different sizes may have the same hash. However, this means that it is also possible that two files with the same size may have the same hash.

Ignacio Vazquez-Abrams 2010-03-14 15:41:43

The whole point of the family of cryptographic hashes (MD5, SHA-x, etc) is to make collisions vanishingly unlikely. The notion is that official legal processes are prepared to depend on it being impractical to manufacture a collision on purpose. So, really, it's a bad use of space and CPU time to add a belt to the suspenders of these hashes.

bmargulies 2010-03-14 19:14:05

ansaurus

tags:

views:

answers:

Are hash collisions with different file sizes just as likely as same file size?

related questions