ansaurus

Question

Does a hash function output need to be bounded less than the number of buckets?

Answer 1

+4 A:

I would hazard that he is completely wrong. BOUNDS should be the number of buckets or the last few buckets are going to be underused.

Further, the bounding of the output to the number of buckets should be OUTSIDE the hash function. This is an implementation detail of that particular hash table. You might have a very large table using lots of buckets and another using few. Both should share the same string->hash function

Further, if you read the page that you linked to it is quite interesting. I would have implemented his hash table as something like 10,000 buckets - For those who haven't read it, the article suggests ~ 4,000,000,000 buckets to store 1,000,000 or so possible words. For collisions, each bucket has a vector of word structures, each of those containing a count, a plaintext string and a hash (unique within the bucket). This would use far less memory and work better with modern caches since your working set would be much smaller.

To further reduce memory usage you could experiment with culling words from the hash during the input phase that look like they are below the top 100,000 based on the current count.

Tom Leys 2009-05-13 04:02:13

Thanks for the input Tom. I had a feeling that he is wrong, but I had to ask StackOverflow to be sure that it wasn't myself who was lacking in knowledge.

Unknown 2009-05-13 04:38:51

"BOUNDS should be the number of buckets or the last few buckets are going to be underused", do you think maybe this is some kind of special trick when the hash table needs to be resized?

Unknown 2009-05-13 04:53:06

I completely agree that %BOUNDS is entirely out of place. The hash of a given input should be *independent* of what that hash is being used for. You can use it as a key into a table, you can tie it in a bow, you can pipe it to \dev\null. The hash function should be blissfully ignorant.

leo grrr 2009-05-14 00:49:03

@Unknown - No, it is a limitation. The hash function is putting an upper limit on the size of your map. Furthermore, if your map is not a multiple of the hash bounds, the first portion (map bounds % hash bounds) of your map will be overused. The wider the range of the hash functon output the better it is in the general case. Equal distribution of hash outputs over that range is even more important. However, I am no expert in this area.

Tom Leys 2009-05-14 23:32:34

Answer 2

A:

I once interviewed for a job at a well-known search company. I got the exact same question. I tried to tackle it by using hash table.

One thing that I learnt from that interview was that at a well-known search company, you do not propose hashes as solutions. You use any tree-like structure you like but you always use ordered structure, not hash table.

2009-05-13 08:53:08

Can you explain more?

Unknown 2009-05-13 18:54:35

Answer 3

A:

A simple explicit suffix tree would only use worst case maybe 500k memory (with a moderately efficient implementation, 4 byte character encodings, and relatively long English words that have minimal overlap) to do the same thing.

I think the guy in the article outsmarted himself.

patros 2009-05-19 18:31:02

ansaurus

tags:

views:

answers:

Does a hash function output need to be bounded less than the number of buckets?

related questions