views:

98

answers:

5

If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?

+1  A: 

Assume we have a pre-determined message1. hash1 = md5(message1)

Now choose a message2 randomly, and set hash2 = md5(message2).

In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.

It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.

MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.

Matthew Flaschen
thanks, thats what I assumed, so was wondering what the point of 16 bytes was (in the case of md5) I'm assuming maybe the extra bytes are just an artefact of the algorithm, and not something strictly necessary. How many digits should one use to be safe.
Mark
Well, the more digits there are the less collisions are necessary; why would you truncate the hash?
Michael Mrozek
255^4 is not actually all that big a number in many hashing applications. It's only about 4 billion; there are many applications which use hashes for record counts in the millions if not billions, and even an aggregate ~1/1000 chance of a collision is a bad thing.
Amber
I would truncate because I need an alphanumeric filename of reasonable length, and 16 bytes would correspond to 2-3 times that many valid characters.
Mark
Just for the record the hash only involves generating a unique filename for an image file, where two identical files are given the same name, there's no issue of security. Now it seems that 4 bytes might be more than sufficient.
Mark
32 characters is too long for a filename? Git uses hashes as filenames all over the place
Michael Mrozek
the user will in fact see the filenames, and I don't want something completely outlandish.
Mark
A: 

Depends on the purpose of the hash.

Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.

dan04
Well, I'm using windows' md5 hash. When you say the lower bits, would you mean lower index in the returned byte array.
Mark
I take it that some sort of obvious similarity between two strings, won't make their hashes more susceptable to collision (as long as they are in fact different).
Mark
Not unless the hashing algorithm lacks [diffusion](http://en.wikipedia.org/wiki/Confusion_and_diffusion). See the [avalanche effect](http://en.wikipedia.org/wiki/Avalanche_effect)
Michael Mrozek
+5  A: 

Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.

Alex Martelli
Sort of thing I need to know, thanks. Looks like I should bump it up to 8 bytes then.
Mark
@Mark: Even with 8 bytes you will still (with 50% probability) get a collision after 2^32 = 4 million hashes
BlueRaja - Danny Pflughoeft
2^32 is 4 billion
Mark
+2  A: 

It is 256, not 255.

Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.

Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).

Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.

MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.

Thomas Pornin
"MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions." The way I've been reading it is, md5 is *not* a concern regarding accidental collisions - only intentional (security-related) ones.
Mark
+1  A: 

If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.

If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:

fdadda221fd71619e6c0139730b012577dd4de90

fdadda221fd71619e6c/0139730b012577dd4de90

fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90
tadman
Couldn't find any platform sdk functions for UUID. Am currently using CryptCreateHash - works fine.
Mark