views:

236

answers:

5

I understand that according to pigeonhole principle, if number of items is greater than number of containers, then at least one container will have more than one item. Does it matter which container will it be? How does this apply to MD5, SHA1, SHA2 hashes?

+2  A: 

The point of a hash function is to randomly distribute items into containers. For any good hash function, it doesn't/shouldn't "matter" which container is which as they must be indistinguishable.

This does not apply to "perfect hash" implementations which attempt to do better than random distribution — unlike the algorithms you mentioned.

As Michael mentioned, collisions happen LONG before there are as many items as slots. You must have graceful collision handling (or a perfect hash) if you want to handle the birthday paradox.

Potatoswatter
+4  A: 

If you have more items to hash than you have slots, then you'll have hash collisions. But if you have a poor hashing algorithm, then you'll see collisions even when the items / slots ratio is very small. A good hashing algorithm (including most of the ones you'll see in the wild) will attempt to spread the resulting hashes over the entire output space as evenly as possible, and thus minimize collisions.

Note that a hash collision is not the end of the world. When used in a hash table, for instance, it just means that more than one item is stored in a slot, and the table code will have to traverse a little bit more to find or add the target item, increasing lookup time slightly.

You'll see people refer to MD5 as a "broken" hashing algorithm, when in reality, it's just a poor one to use as a cryptographic hash. It'll be better than one you build yourself.

Michael Petrotta
MD5 is broken, because it is not cryptographically secure; the various SHA algorithms should be used in lieu of MD5. That said, MD5 is good enough if all you're using it for is as a checksum of a downloaded file.
Michael Aaron Safyan
most of the time md5 is not used to be cryptographically secure. It is a very fast hash that is more than good enough in most cases. Its not like hes suggesting using it for passwords ... or are you suggesting implementing a really secure and deliberatley slow blowfish algorithm for use in a hashing algorithm to speed up lookups in a situation where the data rapidly changes? At least it will be cryptographically secure. MD5 still has a valid purpose and is very good at it.
John Nicholas
@MrTortoise: yep, that's exactly right. MD5 is fine for non-cryptographic usage. In case it's not clear to others, **don't** use MD5 for security-sensitive (cryptographic) situations.
Michael Petrotta
A: 

I think which application you're using the hash function for is an important distinction. Frequent collision in hashing containers, for example, can degrade performance. Frequent collision in cryptography will have far more devastating consequences (see: cryptographic hash function on Wikipedia).

Collision happens relatively easily even with "decent" hashing algorithm. For example, in Java,

String s = new String(new char[size]);

always hashes to 0. That is, all strings containing only \0 hash to 0 in Java.


As for "does it matter which container will it be?", again it depends on the application. You can design hash functions that would hash "similar" objects to nearby values. This is useful when you want to search for similar objects, for example. Just hash them all and see where they fall. In this case, collisions or near-collisions are desirable, because it groups objects that are similar.

In other applications, you want even the slightest change in the object to result in an entirely different hash value. This is the case in cryptography, for example, where you want to be as certain as possible that something has not been modified. It is far more difficult to find different objects that hash to the same value in this case.

polygenelubricants
+6  A: 

No it doesn't matter which container it is, and in fact this is not that important to cryptographic hashes; much more important is the birthday paradox, which says that you only need to hash sqrt(numberNeededByPidgeonHolePrincipal) values, on average, before finding a collision.

Thus, the hash needs to be large enough that the square-root of the search space is too large to brute-force. The square-root-of-search-space for SHA1 is 280, and as of July 2010, no two values have ever been found with the same SHA1-hash (though I predict that will happen within the next year or two..); same with SHA2, a family of hashes which all have an even larger search-space. MD5 has been broken for a while though.

BlueRaja - Danny Pflughoeft
A: 

Depending on your application, cryptographic hashes like MDA, SHA1/2 etc. may not be the ideal choice, precisely because they appear as if entirely random, thus giving you collisions as prediced by the birthday paradox. Traditionally, one reason for using simple hashes based on the remainder operation is that keys were expected to be serial numbers or similar, so that a remainder operation would sustain fewer collisions than expected at random. E.g. if the keys are the integers are 1..1000 you might have no collisions at all in a container of size 1009 if your hash function is the key mod 1009. People would sometimes hand-tune systems by carefully picking container size and hash function to achieve an even split.

Of course, if you have to worry about people maliciously choosing keys that will cause you difficulty, or an upstream system sending you very biassed keys (because e.g. it has its own hash table and decides to process all keys that hash to X at once). you may wish to use a hash based on a keyed cryptographic hash function to defend against this.

mcdowella