tags:

views:

75

answers:

3

I'm trying to hash a string into an integer for placing it in an array. However I do not know all too much about hashing functions, and that's why my current method is just adding all the ASCII numbers of the characters together and taking it mod the array size.

Are there any simple faster/better methods?

A: 

Jenkins hash function should help you get started.

my current method is just adding all the ASCII numbers of the characters together and taking it mod the array size.

You discard important bit of information which is the position of the character in the string. That is a bad idea, since then strings "AB" and "BA" would have same the same hash value.

Instead of simple addition, keeping it primitive, one can use expression like hash = hash*P1 + str[i]*P2 + P3; where Pi are some prime numbers. That's how I do it if I need a hash function quickly. I often use 7, 5 and 3 as the primes, but the numbers should be obviously adjusted (as well as initial value of hash) so that the result of hash function is usable to your task.

For more information read the corresponding (and rather informative) Wikipedia article.

Dummy00001
Sure, start there, but then keep moving. As the article you linked to shows, it has pretty awful avalanche characteristics. Just look at all the yellow pixels.
Steven Sudit
@Steven: The yellow dots are really few there. And it is definitely better than simple addition. And it is fast and works fine on real world strings. I use it in several places and it definitely improved key distribution.
Dummy00001
I don't deny for a moment that it's better than addition -- I even pointed out the AB/BA problem in a comment to an answer that was deleted. However, there's no reason there should be *any* yellow pixels. Projects like memcachedb use Murmur for a good reason!
Steven Sudit
Don't be silly. There would be always yellow pixels - because one still has to apply modulo to the result of the hash function and that would discard more useful bits of information. And murmur doesn't strike to be much different from Jenkins or FNV. Or you haven't even bothered to check how Murmur works? I think you're being over-pedantic here.
Dummy00001
Oh I'm *definitely* pedantic, but I'm also completely right. It's not radically different, but it's designed to avoid this particular flaw. For a good visual comparison, look at http://sites.google.com/site/murmurhash/avalanche
Steven Sudit
From the one of the [linked articles](http://home.comcast.net/~bretm/hash/7.html): "This hash function by Bob Jenkins should be suitable for general purpose use, either for hash table lookup, basic file fingerprinting, or other non-cryptographic uses. Both the upper and lower bits are uniformly distributed. The hash function achieves avalanche for every bit combination tested."
Dummy00001
At risk of repeating myself, I'll point out that this comment was about the lookup3 algorithm, whereas the avalanche graphic we were discussing was for the one-at-a-time algorithm. In short, Mulvey is not backing you up on this one.
Steven Sudit
+2  A: 

The FNV-1a hash is quick and easy to implement.

Greg Hewgill
It's quick, easy and terrible. Its avalanche characteristics are unacceptable. If you must use an FNV variant, FNV-1a is much, much better in this regard. Or, better yet, don't use FNV at all. There are so many better alternatives that this one is a waste of time.
Steven Sudit
+1. FNV Wikipedia entry beats that one of the Jenkins. And language agnostic.
Dummy00001
@Dummy: Only because they omitted the avalanche results. It turns out that FNV-1a doesn't suck, but FNV does. For that matter, Jenkins' lookup3 is ok, whereas his one-at-a-time sucks.
Steven Sudit
@Greg: I doubt you care about your rep score any more than I do, but if you edit your answer to specify FNV-1a, then I'l be glad to remove the downvote. I still don't like this FNV variant, but at least it's not terrible.
Steven Sudit
@Steven Sudit: That's fair, changed the link.
Greg Hewgill
Done. If I'm reading the chart at http://www.strchr.com/hash_functions right, it looks like FNV-1a is optimized for longer inputs, so it would not do well with short strings.
Steven Sudit
A: 

As Dummy00001 pointed out, this has been asked and answered before. Take a look at http://stackoverflow.com/questions/1359654/best-algorithm-for-hashing-number-values, particularly the suggestion of using MurmurHash.

I'd recommend MurmurHash because:

  1. It's very fast.

  2. Its distribution and avalanche characteristics are excellent for a non-cryptographic hash.

  3. Its worst-case behavior is still pretty good.

I've used it. It doesn't suck.

edit

There was a lot of discussion about how to best port it to Delphi, on https://forums.embarcadero.com/thread.jspa?threadID=13902&tstart=0. The resulting code is available at https://forums.codegear.com/thread.jspa?threadID=14879

Steven Sudit
Any pointers to proof that the hash function doesn't suffer from the avalanche effect you complain elsewhere so much? Murmur uses pretty much the same algorithm, only slightly differently.
Dummy00001
I answered this elsewhere, but I'll repeat it here: http://sites.google.com/site/murmurhash/avalanche
Steven Sudit
@Dummy0001: After you've looked, remove the downvote.
Steven Sudit
Some people might think that, since the source is the canonical site for MurmurHash, it must be suspect. Those people should read more carefully, and follow the link to http://home.comcast.net/~bretm/hash/
Steven Sudit
@Steven: regarding the link - 404 errors when trying to look into at the diagrams at the link. Nor you have provided any link to the sample implementation of advertised algorithm. P.S. the second link (~bretm) shows different avalanche diagram for Jenkins and conclusion is also nice: "This hash function by Bob Jenkins should be suitable for general purpose use, either for hash table lookup, basic file fingerprinting, or other non-cryptographic uses." You are very *trustworthy* source of information ;)
Dummy00001
@Dummy: Pay attention: Mulvey’s diagram was for lookup3, not one-at-a-time.
Steven Sudit
@Steven: Pay attention, I linked to the Wikipedia article which links to several implementation - and plethora of other useful information. Remove down-vote from my answer, re-edit your own answer into something more up-vote-worthy (don't you know how to post links here?), and I would remove my down-vote.
Dummy00001
A more comprehensive survey of hash functions, focusing on collision-avoidance and speed, can be found at http://www.strchr.com/hash_functions.
Steven Sudit
Edit your answer so that it clearly recommends *only* lookup3, and I'll remove the downvote. I will still suggest that lookup3 be avoided, since MurmurHash is at least as good in terms of distribution and avalanche, and yet it's faster.
Steven Sudit
It turns out the 404 errors are due to the fact that the author is in the middle of moving the site to http://tanjent.com/doku.php?id=murmurhash
Steven Sudit