Can two different strings generate the same MD5 hash code?

views:

1351

answers:

+4 Q:

Can two different strings generate the same MD5 hash code?

For each of our binary assets we generate a MD5 hash. This is used to check whether a certain binary asset is already in our application. But is it possible that two different binary assets generate the same MD5 hast. So is it possible that two different strings generate the same MD5 hash?

+6 A:

Yes, it is possible. This is in fact a Birthday problem. However the probability of two randomly chosen strings having the same MD5 hash is very low.

See this and this questions for examples.

sharptooth 2009-11-18 13:38:27

What probability? That of collision? No, that would be 1, i.e. very high. ;-)

Konrad Rudolph 2009-11-18 13:40:25

Well, true. There surely exist two strings with the same MD5 hash.

sharptooth 2009-11-18 13:41:53

I've known this as the pigeon-hole problem.

Daniel A. White 2009-11-18 13:51:54

http://en.wikipedia.org/wiki/Pigeonhole_principle

Daniel A. White 2009-11-18 13:52:29

the birthday problem just concerns the liklyhood of a collision. for proof there must be one you want the pidgeon hole principle

jk 2009-11-18 13:56:35

+1 A:

Yes, it is! Collision will be a possibility (although, the risk is very small). If not, you would have a pretty effective compression method!

EDIT: As Konrad Rudolph says: A potentially unlimited set of input converted to a finite set of output (32 hex chars) will results in an endless number of collisions.

jensgram 2009-11-18 13:38:51

+11 A:

MD5 is a hash function – so yes, two different strings can absolutely generate colliding MD5 codes.

In particular, note that MD5 codes have a fixed length so the possible number of MD5 codes is limited. The number of strings (of any length), however, is definitely unlimitd so it logically follows that there must be collisions.

Konrad Rudolph 2009-11-18 13:38:51

+1 A:

Yes I have read that it is possible but I fail to understand the reasonings behind it due to laziness. Here is a link to a demo explaining how collisions can take place.

Peter D 2009-11-18 13:40:09

+3 A:

Yes, it is possible. It is called a Hash collision.

Having said that, algorithms such as MD5 are designed to minimize the probability of a collision.

The Wikipedia entry on MD5 explains some vulnerabilities in MD5, which you should be aware of.

iWerner 2009-11-18 13:40:14

+5 A:

Yes, of course: MD5 hashes have a finite length, but there are an infinite number of possible character strings that can be MD5-hashed.

Tony Andrews 2009-11-18 13:41:43

+1 A:

As other people have said, yes, there can be collisions between two different inputs. However, in your use case, I don't see that being a problem. I highly doubt that you will run into collisions - I've used MD5 for fingerprinting hundreds of thousands of image files of a number of image (JPG, bitmap, PNG, raw) formats at a previous job and I didn't have a collision.

However, if you are trying to fingerprint some kind of data, perhaps you could use two hash algorithms - the odds of one input resulting in the same output of two different algorithms is near impossible.

Thomas Owens 2009-11-18 13:44:47

+2 A:

Just to be more informative. From a math point of view, Hash functions are not injective.
It means that there is not a 1 to 1 (but one way) relationship between the starting set and the resulting one.

Bijection on wikipedia

EDIT: to be complete injective hash functions exist: it's called Perfect hashing.

Roubachof 2009-11-18 13:45:40

+3 A:

About MD5 collision examples: MD5 Collision Demo and MD5 Collisions, Visualised

Rubens Farias 2009-11-18 13:49:33

+12 A:

For a set of even billions of assets, the chances of random collisions are negligibly small -- nothing that you should worry about. Considering the birthday paradox, given a set of 2^64 (or 18,446,744,073,709,551,616) assets, the probability of a single MD5 collision within this set is 50%. At this scale, you'd probably beat Google in terms of storage capacity.

However, because the MD5 hash function has been broken (it's vulnerable to a collision attack), any determined attacker can produce 2 colliding assets in a matter of seconds worth of CPU power. So if you want to use MD5, make sure that such an attacker would not compromise the security of your application!

Also, consider the ramifications if an attacker could forge a collision to an existing asset in your database. While there are no such known attacks (preimage attacks) against MD5 yet, it will likely be possible soon by extending the research on collision attacks.

If these turn out to be a problem, have a look at the SHA-2 series of hash functions (SHA-256, SHA-384 and SHA-512). There is no downside to using these -- besides the longer hash output.

intgr 2009-11-18 13:51:33

'Days' is a massive overstatement at this point, as I understand it.

Nick Johnson 2009-11-18 15:41:09

True, I updated my post. The 2004 random collision attack is very fast indeed. The 2007 MD5 **prefix** collision attack can take days -- but is generally much more useful to an attacker

intgr 2009-11-18 15:53:01

See Rubens' answer for a working example that will generate a collision between two different executables in a matter of hours. :)

Nick Johnson 2009-11-18 16:08:02

Thanks, you are right.

intgr 2009-11-18 16:23:37

ansaurus

tags:

views:

answers:

Can two different strings generate the same MD5 hash code?

related questions