I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?
views:
3733answers:
8That always depends on your data-set.
I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.
Lots of good CRC32 implementations are easy to find on the net.
Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:
http://smallcode.weblogs.us/ <-- further down the page.
Boost has an boost::hash library which can provides some basic hash functions for most common types.
From some old code of mine:
/* magic numbers from http://www.isthe.com/chongo/tech/comp/fnv/ */
static const size_t InitialFNV = 2166136261U;
static const size_t FNVMultiple = 16777619;
/* Fowler / Noll / Vo (FNV) Hash */
size_t myhash(const string &s)
{
size_t hash = InitialFNV;
for(size_t i = 0; i < s.length(); i++)
{
hash = hash ^ (s[i]); /* xor the low 8 bits */
hash = hash * FNVMultiple; /* multiply by the magic number */
}
return hash;
}
Its fast. Really freaking fast.
One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)
template <> struct myhash{};
template <> struct myhash<string>
{
size_t operator()(string &to_hash) const
{
const char * in = to_hash.c_str();
size_t out=0;
while(NULL != *in)
{
out*= 53; //just a prime number
out+= *in;
++in;
}
return out;
}
};
hash_map<string, int, myhash<string> > my_hash_map;
It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.
You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.
I've use the Jenkins hash to write a Bloom filter library, it has great performance.
Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c
This is what Perl uses for its hashing operation, fwiw.
If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer (and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace hash_map
with a simple array or vector
.
If you're not hashing a fixed set of words, then obviously this doesn't apply.
If your strings are on average longer than a single cache line, but their length+prefix are reasonably unique, consider hasing just the length+first 8/16 characters. (The length is contained in the std::string object itself and therefore cheap to read)
I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.
unsigned int
hash(
const char* s,
unsigned int seed = 0)
{
unsigned hash = seed;
while (*s)
{
hash = hash * 101 + *s++;
}
return hash;
}