views:

242

answers:

5

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?

How does the answer change if I only have 100 million URLs?

It seems to me that collisions will be extremely rare, but these things tend to be confusing.

Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.

EDIT: not quite a duplicate

+1  A: 

You have tagged this as "birthday-paradox", I think you know the answer already.

P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)

where n is 1 billion in your case.

You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.

J-16 SDiZ
+3  A: 

If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.

I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.

Steven Sudit
So, uh, do you mean 2^64 (18,446,744,073,709,551,616) where you said 2^32, above? The question talks about 64 bits, but not 32.
unwind
No, he means 2^32. That means that for 100M urls there is less than 1% chance of 1 collision. I think I'll take it.
itsadok
That's correct, itsadok, I do mean 2^32, not 2^64. That's the whole point of the birthday paradox: the chance of any two random values matching each other is counterintuitively much higher than the chance of any one random value matching a single target
Steven Sudit
+1  A: 

From what I see, you need a hash function with the following requirements,

  1. Hash arbitrary length strings to a 64-bit value
  2. Be good -- Avoid collisions
  3. Not necessarily one-way (security not required)
  4. Preferably fast -- which is a necessary characteristic for a non-security application

This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).

You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).

Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.

nik
+1  A: 

Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.

The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...

So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.

...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)

Arjan Einbu
A: 

If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.

E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.