I want to truncate an md5 hash to about half size. How much does that increase the odds of collisions? if I'm dealing with around 500 000 generations, should I be worried about a collision? what about 1m generations.
views:
178answers:
2There's an interesting mathematical problem called the birthday problem that deals with that kind of situation. The fact is that the more entries you push in, the higher the chances to have a collision.
Following the table posted on the above link, assuming your digests are 64 bits each (since a single MD5 hash is 128 bits) and that MD5 have a uniform distribution, there is a very low chance that two hashes will collide. It becomes significant (1% chance or more) at 610,000,000 entries.
The math you're looking for is on Wikipedia's birthday attack page.
We consider the following experiment. From a set of H values we choose n values uniformly at random thereby allowing repetitions. Let p(n; H) be the probability that during this experiment at least one value is chosen more than once. This probability can be approximated as
With 128 bits the chance of a collision among 500,000 hash values is around 10-28. If you halve the size of the collision space then the chance of collision is around 10-9. That is, even though the chance is vastly greater it's still very, very low. It depends on how critical it is that there be no collisions. 10-9 is on the order of one in a billion, so while extremely unlikely it's within the realm of possibility.
For reference:
1028 = 10 octillion = 10 billion billion billion
109 = 1 billion