views:

233

answers:

3

I was reading about this person's interview "at a well-known search company".

http://asserttrue.blogspot.com/2009/05/one-of-toughest-job-interview-questions.html

He was asked a question which led to him implementing a hash table. He said the following:

HASH = INITIAL_VALUE;
FOR EACH ( CHAR IN WORD ) {
HASH *= MAGIC_NUMBER
HASH ^= CHAR
HASH %= BOUNDS
}
RETURN HASH

I explained that the hash table array length should be prime, and the BOUNDS number is less than the table length, but coprime to the table length.

Why should the BOUNDS number be less than the number of buckets? What does being coprime to the table length do? Isn't it supposed to be coprime to the BOUNDS?

+4  A: 

I would hazard that he is completely wrong. BOUNDS should be the number of buckets or the last few buckets are going to be underused.

Further, the bounding of the output to the number of buckets should be OUTSIDE the hash function. This is an implementation detail of that particular hash table. You might have a very large table using lots of buckets and another using few. Both should share the same string->hash function

Further, if you read the page that you linked to it is quite interesting. I would have implemented his hash table as something like 10,000 buckets - For those who haven't read it, the article suggests ~ 4,000,000,000 buckets to store 1,000,000 or so possible words. For collisions, each bucket has a vector of word structures, each of those containing a count, a plaintext string and a hash (unique within the bucket). This would use far less memory and work better with modern caches since your working set would be much smaller.

To further reduce memory usage you could experiment with culling words from the hash during the input phase that look like they are below the top 100,000 based on the current count.

Tom Leys
Thanks for the input Tom. I had a feeling that he is wrong, but I had to ask StackOverflow to be sure that it wasn't myself who was lacking in knowledge.
Unknown
"BOUNDS should be the number of buckets or the last few buckets are going to be underused", do you think maybe this is some kind of special trick when the hash table needs to be resized?
Unknown
I completely agree that %BOUNDS is entirely out of place. The hash of a given input should be *independent* of what that hash is being used for. You can use it as a key into a table, you can tie it in a bow, you can pipe it to \dev\null. The hash function should be blissfully ignorant.
leo grrr
@Unknown - No, it is a limitation. The hash function is putting an upper limit on the size of your map. Furthermore, if your map is not a multiple of the hash bounds, the first portion (map bounds % hash bounds) of your map will be overused. The wider the range of the hash functon output the better it is in the general case. Equal distribution of hash outputs over that range is even more important. However, I am no expert in this area.
Tom Leys
A: 

I once interviewed for a job at a well-known search company. I got the exact same question. I tried to tackle it by using hash table.

One thing that I learnt from that interview was that at a well-known search company, you do not propose hashes as solutions. You use any tree-like structure you like but you always use ordered structure, not hash table.

Can you explain more?
Unknown
A: 

A simple explicit suffix tree would only use worst case maybe 500k memory (with a moderately efficient implementation, 4 byte character encodings, and relatively long English words that have minimal overlap) to do the same thing.

I think the guy in the article outsmarted himself.

patros