views:

5343

answers:

12

I have lots of unrelated named things that I'd like to do quick searches against. An "aardvark" is always an "aardvark" everywhere, so hashing the string and reusing the integer would work well to speed up comparisons. The entire set of names is unknown (and changes over time). What is a fast string hashing algorithm that will generate small (32 or 16) bit values and have a low collision rate?

I'd like to see an optimized implementation specific to C/C++.

+4  A: 

Have a look at GNU gperf.

Rob Wells
Yay for perfect hash generators!
Chris Jester-Young
Perfect hashing is NOT appropriate for this application, since the set of names is unknown and changes. Therefore, gperf won't work for this.
TimB
A: 

CRC-32. There is about a trillion links on google for it.

1800 INFORMATION
CRCs are designed for error detection and correction. Their distribution characteristics are typically not very good.
Nick Johnson
Arachnid has obviously never tried CRC32 as hashes. They work well.
Nils Pipenbrinck
"CRC32 was never intended for hash table use. There is really no good reason to use it for this purpose." cf. http://home.comcast.net/~bretm/hash/8.html
obecalp
+8  A: 

One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs.

Nick Johnson
If you're going to use FNV, stick to FNV-1a, since it has acceptable results on the avalanche test (see http://home.comcast.net/~bretm/hash/6.html). Or just use MurmurHash2, which is better in both speed and distribution (http://murmurhash.googlepages.com/).
Steven Sudit
+8  A: 

For a fixed string-set use gperf.

If your string-set changes you have to pick one hash function. That topic has been discussed before:

http://stackoverflow.com/questions/98153/

Nils Pipenbrinck
+9  A: 

Murmur Hash is pretty nice.

yrp
Yes, this is the current leading general purpose hash function for hash tables. It's non-crypto, of course, with a pair of obvious differential.
obecalp
+3  A: 

Another solution that could be even better depending on your use-case is interned strings. This is how symbols work e.g. in Lisp.

An interned string is a string object whose value is the address of the actual string bytes. So you create an interned string object by checking in a global table: if the string is in there, you initialize the interned string to the address of that string. If not, you insert it, and then initialize your interned string.

This means that two interned strings built from the same string will have the same value, which is an address. So if N is the number of interned strings in your system, the characteristics are:

  • Slow construction (needs lookup and possibly memory allocation)
  • Requires global data and synchronization in the case of concurrent threads
  • Compare is O(1), because you're comparing addresses, not actual string bytes (this means sorting works well, but it won't be an alphabetic sort).

Cheers,

Carl

Carl Seleborg
A: 

The Hsieh hash function is pretty good, and has some benchmarks/comparisons, as a general hash function in C. Depending on what you want (it's not completely obvious) you might want to consider something like cdb instead.

James Antill
A: 

There is some good discussion in this previous question

And a nice overview of how to pick hash functions, as well as statistics about the distribution of several common ones here

AShelly
A: 

Why don't you just use Boost libraries? Their hashing function is simple to use and most of the stuff in Boost will soon be part of the C++ standard. Some of it already is.

Boost hash is as easy as

#include <boost/functional/hash.hpp>

int main()
{
    boost::hash<std::string> string_hash;

    std::size_t h = string_hash("Hash me");
}

You can find boost at boost.org

Bernard
Both STL and boost tr1 has extremely weak hash function for strings.
obecalp
A: 

Bob Jenkins has many hash functions available, all of which are fast and have low collision rates.

sixlettervariables
The hashes are quite solid, and technically interesting, but not necessarily fast. Consider that One-at-a-Time hash processes input byte by byte, where other hashes take 4 or even 8 bytes at a time. The speed differnece is substantial!
Steven Sudit
Bob's hashes are very fast: http://www.azillionmonkeys.com/qed/hash.html
sixlettervariables
A: 

You can see what .NET uses on the String.GetHashCode() method using Reflector.

I would hazard a guess that Microsoft spent considerable time optimising this. They have printed in all the MSDN documentation too that it is subject to change all the time. So clearly it is on their "performance tweaking radar" ;-)

Would be pretty trivial to port to C++ too I would have thought.

NathanE
A: 

There's also a nice article at eternallyconfuzzled.com.

Jenkins' One-at-a-Time hash for strings should look something like this:

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
     hash += *s;
     hash += (hash << 10);
     hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}
Christoph