views:

503

answers:

4

What is the best 32bit hash function for relatively short strings?

Strings are tag names that consist of English letters, numbers, spaces and some additional characters (#, $, ., ...). For example: Unit testing, C# 2.0.

I am looking for 'best' as in 'minimal collisions', performance is not important for my goals.

+3  A: 

I'm not sure if it's the best choice, but here is a hash function for strings:

The Practice of Programming (HASH TABLES, pg. 57)

/* hash: compute hash value of string */
unsigned int hash(char *str)
{
   unsigned int h;
   unsigned char *p;

   h = 0;
   for (p = (unsigned char*)str; *p != '\0'; p++)
      h = MULTIPLIER * h + *p;
   return h; // or, h % ARRAY_SIZE;
}

Empirically, the values 31 and 37 have proven to be good choices for the
multiplier in a hash function for ASCII strings.

Nick D
Yep we use this exact hashing function with MULTIPLIER = 37 for strings and paths. Works well for us and I've yet to encounter a collision issue even after 2 years ( of course there's no guarantee we won't though )
zebrabox
This definitely looks simple enough. Any ideas why FNV was created if much simpler approach works?
Andrey Shchekin
@Andrey Shchekin, I use FNV hash when I deal with raw bytes (blobs). Perhaps, the above function yields better results specifically with strings. I'm not sure.
Nick D
@Andrey + Nick D - Main reason we use the above algorithm is for speed. I know that performance wasn't a priority for Andrey so may be not be relevant. I've also used FNV32 but more hashing binary data like Nick D mentioned. Can't really compare like for like though - might be worth trying both out and seeing which one has the lower collision rate
zebrabox
+1  A: 

You might check out murmurhash2. It is fast, also for small strings, and has a good mixing final step so it is even good mixed for very small strings.

Ritsaert Hornstra
+3  A: 

If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.

Nick Johnson
md5 is perfect for this scenario
Alex
MD4 (see http://tools.ietf.org/html/rfc1320 ) may be even better, since it is slightly simpler to implement than MD5. Note that neither MD4 nor MD5 is indistinguishable from random (both were "cryptographically broken") but they still are close enough for the purpose at hand.
Thomas Pornin
Do you think it would have less collisions than Nick D's answer? I am somewhat undecided on what to approve/use.
Andrey Shchekin
@Thomas MD5 is broken in the sense that you can create a hash collision - two plaintexts that produce the same hash. That doesn't mean that the output of MD5 is distinguishable from randomness - there's no preimage attack against MD5. Which is easier to implement is kind of irrelevant, too - he'll almost certainly have a pre-made MD5 or SHA1 implementation in his language of choice.
Nick Johnson
@Andrey Secure hashes such as SHA1 and MD5 have the fewest collisions theoretically possible. Faster hashes such as the one Nick suggests are excellent where speed is important, but have more collisions. Your language of choice almost certainly has a pre-made implementation of MD5 or SHA1, too.
Nick Johnson
@Nick: attacks on MD5 are based on a differential path. By applying the input difference on an MD5 input, you have a small but higher-than-random probability of finding the expected difference in the output. This does not lead to a preimage attack, but it makes MD5 distinguishable from a random oracle. In the case of MD4, this was shown to be (academically) exploitable when used in HMAC (where collisions per se are no worry).
Thomas Pornin
Ok, thanks, I'll use it.
Andrey Shchekin
@Thomas I stand corrected. I didn't know that MD4 was broken even for HMACs - interesting result!
Nick Johnson
A: 

If it's rare that users add new tags, then you can use a perfect hash (http://en.wikipedia.org/wiki/Perfect_hash_function) that's recomputed each time a new tag is added. Of course, without knowing the problem you are really trying to solve, it's guesswork to figure out what you might do.

Paul Hankin