tags:

views:

59

answers:

4

I have a list of n strings (names of people) that I want to store in a hash table or similar structure. I know the exact value of n, so I want to use that fact to have O(1) lookups, which would be rendered impossible if I had to use a linked list to store my hash nodes. My first reaction was to use the the djb hash, which essentially does this:

for ( i = 0; i < len; i++ )
    h = 33 * h + p[i];

To compress the resulting h into the range [0,n], I would like to simply do h%n, but I suspect that this will lead to a much higher probability of clashes in a way that would essentially render my hash useless.

My question then, is how can I hash either the string or the resulting hash so that the n elements provide a relatively uniform distribution over [0,n]?

+3  A: 

It's not enough to know n. Allocation of an item to a bucket is a function of the item itself so, if you want a perfect hash function (one item per bucket), you need to know the data.

In any case, if you're limiting the number of elements to a known n, you're already technically O(1) lookup. The upper bound will be based on the constant n. This would be true even for a non-hash solution.

Your best bet is to probably just use the hash function you have and have each bucket be a linked list of the colliding items. Even if the hash is less than perfect, you're still greatly minimising the time taken.

Only if the hash is totally imperfect (all n elements placed in one bucket) will it be as bad as a normal linked list.

If you don't know the data in advance, a perfect hash is not possible. Unless, of course, you use h itself as the hash key rather than h%n but that's going to take an awful lot of storage :-)

My advice is to go the good-enough hash with linked list route. I don't doubt that you could make a better hash function based on the relative frequencies of letters in people's names across the population but even the hash you have (which is ideal for all letters having the same frequency) should be adequate.

And, anyway, if you start relying on frequencies and you get an influx of people from those countries that don't seem to use vowels (a la Bosniaa), you'll end up with more collisions.

But keep in mind that it really depends on the n that you're using.

If n is small enough, you could even get away with a sequential search of an unsorted array. I'm assuming your n is large enough here that you've already established that (or a balanced binary tree) won't give you enough performance.

A case in point: we have some code which searches through problem dockets looking for names of people that left comments (so we can establish the last member on our team who responded). There's only ever about ten or so members in our team so we just use a sequential search for them - the performance improvement from using a faster data structure was deemed too much trouble.


aNo offence intended. I just remember the humorous article a long time ago about Clinton authorising the airlifting of vowels to Bosnia. I'm sure there are other countries with a similar "problem".

paxdiablo
+1 good point, big O complexity doesn't come into play here. This is a case of trying to reduce the magnitude of your constant time lookup.
bshields
@pax but the O(1)-ness of my function is ex-post :( I guess `h%n` is sufficient then.
piggles
You can always generate the perfect hash function at runtime.
R..
+3  A: 

What you're after is called a Perfect Hash. It's a hash function where all the keys are known ahead of time, designed so that there are no collisions.

The gperf program generates C code for perfect hashes.

caf
+1, I didn't know that was a standard GNU utility!
Jim Lewis
I'll give you one for that, @caf. Seeing what gperf outputs for a specific case has set me well on the way to solving a current, particularly thorny, performance problem we're having at work. That's a good tool.
paxdiablo
A: 

It sounds like you're looking for an implementation of a perfect hash function, or perhaps even a minimal perfect hash function. According to the Wikipedia page, CMPH might fit your needs. Disclaimer: I've never used it.

Jim Lewis
A: 

The optimal algorithm for mapping n strings to integers 1-n is to build a DFA where the terminating states are the integers 1-n. (I'm sure someone here will step up with a fancy name for this...but in the end it's all DFA.) Size/speed tradeoff can be adjusted by varying your alphabet size (operating on bytes, half-bytes, or even bits).

R..