views:

532

answers:

6

This is a combinatorics question with some theory in hashing algorithms required.

Let's say the input can be any random sequence of bytes 30 kB to 5 MB of size (I guess that makes quite a few combinations of input values :))

What is the probability that the first 4 bytes (or first n bytes) of a MD5 hash computed from the byte sequence will be the same for distinct files?

In case this can not be computed specifically for MD5 hash, then what is the probability that any hash function that generates uniformly distributed m-byte hashes will compute hash with collision on first n bytes for given range of inputs?

+3  A: 

In absence of more information about the byte values probability, I would day it is 1 in 2^32.

EDIT. Indeed, 1 in 2^16 if you are taking the hexadecimal characters instead of the pure bytes.

EDIT based on comment:

Can MD5 be considered that uniform that a the computed values are absolutely random?

MD5 hash algorithm is designed so that a small change in the input results in a completely different hash, so I would say that MD5 hash bytes are distributed with equal probability (I would not bet anything on it anyway). Anyway you can apply a post-processing to your hash (for example you can use keyed MD5) to increase its randomness (and to make it more secure, by the way, since plain MD5 hashes have been proved to be insecure).

Konamiman
?? Changing the representation from bytes to hexadecimal characters changes the probabilities. THAT does not compute.
High Performance Mark
Well, it depends on "the first 4 bytes" meaning the first real 4 bytes or the first 4 digits of the hex representation of the hash.
Konamiman
Thanks for clarification. Does your answer (and other that say 1/65536 is the probability) take into account birthday paradox?
Marek
Downvoter, please explain what I said that is incorrect.
Konamiman
@Marek: The birthday paradox applies to collections, not pairs. The chance given here is correct for a pair of files. Since we don't know how many files there are in total, the total chance is unknown.
MSalters
A: 

MD5 hashes are typically hexadecimals, so there are 16 possible values for each byte. Therefore for four bytes there are 16*16*16*16 = 65536 possible combinations, making the probability of a hash collision 1:65536.

Kaivosukeltaja
+1 beat me to it
David Hedlund
They are displayed to humans in hexadecimals, but are calculated as 4 DWORDS thus he is referring to the first DWORD. which is 4 bytes == 2^32
Simeon Pilgrim
Thank you for quick answer. However, does this take into account specific properties of MD5? Can MD5 be considered that uniform that a the computed values are absolutely random? Isn't the collision probability larger specifically for MD5 as compared to ideal hashing function that is absolutely uniform?
Marek
Provided that the input is "absolutely random", it is reasonable to assume that output would be "absolutely random" as well.Hash functions are designed to be uniform, so in practice it is safe to assume uniformity, unless the input is specifically crafted to cause collisions with the given hash function.
VladV
A: 

md5 is hexadecimal, so each character can be any one of 16 alleles. So that would make 16^n

For 4 characters, that makes 65536 different possible combinations.

David Hedlund
+3  A: 

For an ideal hash function, the outputs are evenly distributed, so the chances of two colliding are one in 2^32. The birthday paradox, however, tells us that if we're comparing all pairs of hashes, we should expect to see a collision once we have 2^16 hashes, on average - so don't rely on only 4 bytes on the basis that "I have a lot less than 4 billion values".

MD5 isn't an ideal hash function, as we know, but the weaknesses are somewhat incidental here: finding a collision on 4 bytes is well within the realm of a reasonable brute-force attack, so there's no need to resort to cryptographic weaknesses. If you're only concerned about randomly picked data, you're not going to see a significant statistical deviation from randomness.

Nick Johnson
+1  A: 

If you are interested in the odds of two particular inputs having the same 4-byte hash, then it's just 1/2^32. If you're interested in the odds of two inputs out of a set of X total inputs having the same odds, this stays fairly low until you start approaching 2^16 = 65536 distinct inputs in your set, where it reaches near 50% (this phenomenon is known as the Birthday Paradox).

In general, one of the criteria for a hash function to be cryptographically useful is the uniformity across all of the bits.

Yuliy
+2  A: 

The odds of a collision in a n-bit hash are around 1 in 2^(n/2) due to the birthday paradox - so about 1 in 2^16 in this case. If for some reason you were referring to using 32 bits of the hex encoding, of course that would only be the first 16 actual bits, so then the odds of a collision would be about 1 in 2^8.

Given a specific fixed file, the odds that any other file chosen at random will have the same hash as that file is about 2^n. In terms of cryptographic hashes the difference between these is the first is a collision, the other is a preimage.

At this hash size, the weaknesses in MD5 are pretty irrelevant since the best known attacks on MD5 take roughly 2^32 computations, while one can generate a collision in even an ideally secure 32-bit hash in around 2^16 computations (since by merely choosing random inputs, you have 1 in 2^16 chance of a collision, so after roughly 2^16 random guesses you'll probably have found a colliding pair of inputs).

Jack Lloyd
+1. Good answer, the distinction between finding **a** collision (birthday paradox complexity) and **another file** that hashes to the same digest (second preimage) is important.
Krystian