views:

638

answers:

4

Say I have a hash algorithm, and it's nice and smooth (The odds of any one hash value coming up are the same as any other value).

Now say that I know that the odds of picking 2 hashes and there being a collision are (For arguments sake) 50000:1.

Now say I pick 100 hashes. How do I calculate the odds of a collision within that set of 100 values, given the odds of a collision in a set of 2?

What is the general solution to this, so that I can come up with a number of hash attempts after which the odds fall below some acceptable threshold? E.g. I can say things like "A batch of 49999 hash value creations has a high chance of collision".

+7  A: 

This is a generalization of the Birthday problem.

Pesto
+1  A: 

This is called the Birthday problem. To solve it, think instead about the odds of there being no collisions (call it pnc).

  • pnc(1) = 1
  • pnc(2) = 1 - pc(2)
  • pnc(3) = pnc(2) * pnc(2) * pnc(2)
MarkusQ
+1  A: 

That sounds a lot like the Birthday Paradox to me.

You should be able to just substitute the set of possible birthdays (365) with the possible hashes (50000) and run the same calculations they present there.

If you modify the python script presented in the article for your values:

 def bp(n, d):
    v = 1.0
    for i in range(n):

      v = v * (1 - float(i)/d)
    return 1 - v

 print bp(2, 50000)

You end up with the odds of collision on two numbers of 0.00002. Around 265 samples, you have around a 50% chance of having a collision.

Dana
awesome, thanks.
izb
+2  A: 

First calculate the probability that there is not a collision:

hashes_picked = 100
single_collision_odds = 50000

# safe_combinations is number of ways to pick hashes that don't overlap
safe_combinations = factorial(single_collision_odds) / factorial(single_collision_odds - hashes_picked)

# all_combinations is total number of ways to pick hashes
all_combinations = single_collision_odds ** hashes_picked   

collision_chance = (all_combinations - safe_combinations) / all_combinations
recursive