ansaurus

Question

What is a performant string hashing function that results in a 32 bit integer with low collision rates?

Answer 1

+4 A:

Have a look at GNU gperf.

Rob Wells 2008-09-22 10:06:20

Yay for perfect hash generators!

Chris Jester-Young 2008-09-22 10:08:43

Perfect hashing is NOT appropriate for this application, since the set of names is unknown and changes. Therefore, gperf won't work for this.

TimB 2008-09-24 04:43:21

Answer 2

A:

CRC-32. There is about a trillion links on google for it.

1800 INFORMATION 2008-09-22 10:06:47

CRCs are designed for error detection and correction. Their distribution characteristics are typically not very good.

Nick Johnson 2008-09-22 10:09:04

Arachnid has obviously never tried CRC32 as hashes. They work well.

Nils Pipenbrinck 2008-09-22 10:11:06

"CRC32 was never intended for hash table use. There is really no good reason to use it for this purpose." cf. http://home.comcast.net/~bretm/hash/8.html

obecalp 2009-02-16 21:24:38

Answer 3

+8 A:

One of the FNV variants should meet your requirements. They're fast, and produce fairly evenly distributed outputs.

Nick Johnson 2008-09-22 10:08:32

If you're going to use FNV, stick to FNV-1a, since it has acceptable results on the avalanche test (see http://home.comcast.net/~bretm/hash/6.html). Or just use MurmurHash2, which is better in both speed and distribution (http://murmurhash.googlepages.com/).

Steven Sudit 2009-07-10 03:38:27

Answer 4

+8 A:

For a fixed string-set use gperf.

If your string-set changes you have to pick one hash function. That topic has been discussed before:

http://stackoverflow.com/questions/98153/

Nils Pipenbrinck 2008-09-22 10:13:21

Answer 5

+9 A:

Murmur Hash is pretty nice.

yrp 2008-09-22 10:17:20

Yes, this is the current leading general purpose hash function for hash tables. It's non-crypto, of course, with a pair of obvious differential.

obecalp 2009-02-16 19:21:18

Answer 6

+3 A:

Another solution that could be even better depending on your use-case is interned strings. This is how symbols work e.g. in Lisp.

An interned string is a string object whose value is the address of the actual string bytes. So you create an interned string object by checking in a global table: if the string is in there, you initialize the interned string to the address of that string. If not, you insert it, and then initialize your interned string.

This means that two interned strings built from the same string will have the same value, which is an address. So if N is the number of interned strings in your system, the characteristics are:

Slow construction (needs lookup and possibly memory allocation)
Requires global data and synchronization in the case of concurrent threads
Compare is O(1), because you're comparing addresses, not actual string bytes (this means sorting works well, but it won't be an alphabetic sort).

Cheers,

Carl

Carl Seleborg 2008-09-22 11:02:46

Answer 7

A:

The Hsieh hash function is pretty good, and has some benchmarks/comparisons, as a general hash function in C. Depending on what you want (it's not completely obvious) you might want to consider something like cdb instead.

James Antill 2008-09-24 04:13:00

Answer 8

A:

There is some good discussion in this previous question

And a nice overview of how to pick hash functions, as well as statistics about the distribution of several common ones here

AShelly 2008-12-09 21:29:21

Answer 9

A:

Why don't you just use Boost libraries? Their hashing function is simple to use and most of the stuff in Boost will soon be part of the C++ standard. Some of it already is.

Boost hash is as easy as

#include <boost/functional/hash.hpp>

int main()
{
    boost::hash<std::string> string_hash;

    std::size_t h = string_hash("Hash me");
}

You can find boost at boost.org

Bernard 2008-12-16 21:11:27

Both STL and boost tr1 has extremely weak hash function for strings.

obecalp 2009-02-16 19:19:32

Answer 10

A:

Bob Jenkins has many hash functions available, all of which are fast and have low collision rates.

sixlettervariables 2008-12-16 21:30:58

The hashes are quite solid, and technically interesting, but not necessarily fast. Consider that One-at-a-Time hash processes input byte by byte, where other hashes take 4 or even 8 bytes at a time. The speed differnece is substantial!

Steven Sudit 2009-07-23 15:23:38

Bob's hashes are very fast: http://www.azillionmonkeys.com/qed/hash.html

sixlettervariables 2009-07-23 20:20:06

Answer 11

A:

You can see what .NET uses on the String.GetHashCode() method using Reflector.

I would hazard a guess that Microsoft spent considerable time optimising this. They have printed in all the MSDN documentation too that it is subject to change all the time. So clearly it is on their "performance tweaking radar" ;-)

Would be pretty trivial to port to C++ too I would have thought.

NathanE 2008-12-16 21:34:14

Answer 12

A:

There's also a nice article at eternallyconfuzzled.com.

Jenkins' One-at-a-Time hash for strings should look something like this:

#include <stdint.h>

uint32_t hash_string(const char * s)
{
    uint32_t hash = 0;

    for(; *s; ++s)
    {
     hash += *s;
     hash += (hash << 10);
     hash ^= (hash >> 6);
    }

    hash += (hash << 3);
    hash ^= (hash >> 11);
    hash += (hash << 15);

    return hash;
}

Christoph 2008-12-16 22:25:09

ansaurus

tags:

views:

answers:

What is a performant string hashing function that results in a 32 bit integer with low collision rates?

related questions