views:

109

answers:

2

I've been reading a lot about Hash Tables and how to implement on in C and I think I have almost all the concepts in my head so I can start to code my own, I just have a couple of questions that I have yet to properly understand.

As a reference, I've been reading this: http://eternallyconfuzzled.com/jsw_home.aspx

1) As I've read on the site above, a power of two or a prime number is recommended for the Hash Table size. This is basically an array and an array has a fixed size so I can quickly look up for the value I'm looking for. I can't declare a small array if I have a large input as it won't fit and I can't declare a very large array if my input data is not that large cause it's wasted memory.

What is the optimum size for the Hash Table? What should I base my decision on?

2) Also, on that site, there's a couple of hashing functions which I have yet to read them all. It also states that it's always best to use a good known algorithm and to roll my own. And I might do just that, I'll pick one from that site and test it out on my code and see if it minimizes collisions based on my input data.

What's bugging me is how I control the hash range? The hash can't return and integer larger than the Hash Table size or we'll have a serious problem. How do I deal with this?

+2  A: 

1) What you are referring to is the load factor of the hash table - the percentage of buckets that are expected to be filled. Wikipedia has this to say:

With a good hash function, the average lookup cost is nearly constant as the load factor increases from 0 up to 0.7 or so. Beyond that point, the probability of collisions and the cost of handling them increases.

I believe the Java implementation (and probably others) resizes periodically to keep the load factor within an acceptable range.

2) Just use the modulo operator (%) to keep the bucket index legal. The second operator should be the size of your bucket array.

danben
It still doesn't help me for the first question, I mean, I can't just place a random number as the hashtable size. How do I pick one? And what did you mean by "I don't know how to make that a 2."?
Nazgulled
A rather nice answer. OCaml Hashtables are similar to what you describe re automatic resizing. re "how I control the hash range" You don't. That's the beauty of it. You just use your runtime's hashtables as if they were the best thing in the world, and they just are (to a constant factor).
Pascal Cuoq
+1  A: 

Pick a small size for your hash table. As you add stuff to your table, check to see what percentage of the table is being used; when it is greater than 70% full, make the table bigger. This also holds true as you remove elements-- make the table smaller when it is less than 60% full, for instance. Wikipedia has a good description of some strategies for dynamic resizing, but that's the general idea.

I only say this because you seem to have known input data:

If you know the rough order of magnitude of the amount of data you will be storing in the hash table, it's generally good enough to just create a table about that big. (You shouldn't worry about whether everything will fit. Instead, the right thing to think about is how many collisions you will have and how you will handle them.)

As for the right hash function, it's possible that the structure of your input will suggest which one will be correct. For instance, what aspects of your input are likely to be evenly distributed?

Denise
A better way is count how many collisions have occurred as you've inserted elements into the has table. If that number starts to get too high (greater than about 0.8-1.0 collisions per entry on average), then you need to look at either increasing the hash table size or switching to a different hash function.
Chris Dodd