views:

1047

answers:

5

If I notice that a hash table (or any other data structure built on a hash table) is filling up, at what point should you build a new table with more buckets. And given n items in the table so far, how do you figure out how many buckets to use in the new one?

So let's say I have 100 buckets. Should I reorganize it when there are 50 items in it? 500? 5000? Or should I look for the most-full bucket and key on that? Then when I hit that point how big do I make the new hash table?

Related to this, if you know beforehand roughly how many items will go in, is there a way to compute the number of buckets to get a good average performance?

I know the real answer depends on a lot of other considerations like how important is speed vs. size in a specific example, but I'm looking for general guildlines.

I also know that I shouldn't be optimizing this sort of thing unless good profiling has indicated that this is a bottleneck. I'm just thinking about a project that would use a lot of hash tables and wondered how to approach this.

+6  A: 

Generally, you look out for the load factor (informally, you already said that) which is formally defined as α = n / N, i.e. the ratio of used to total buckets. In order for a hash table to function properly (or at least to reason about its performance in mathematical terms), it should be α < 1.

Everything else is really up to empirical tests: If you see that your hash table doesn't perform good starting at α > 0.5, then be sure to stay under that value. This value also depends on your collision resolution techique. Hashing with chaining may require other load factors than hashing with open addressing. Yet another factor is cache locality. If your table gets too big, it won't fit into the main memory. Since your access into the array is random, loading from cache may become a bottleneck.

Konrad Rudolph
+3  A: 

A good rule of the thumb (not always ideal, well, just a rule of the thumb) is to re-hash if the hashtable is filled up to 80%. That means if you have 100 buckets and 80 items inside, regardless how many collision you had before, it's getting time to increase capacity.

How much should you increase it? Well, there is also no perfect value. Simplest solution is to double capacity on each increase. So it goes to 200, 400, 800, and so on. If you think this is too much (after all it will jump from 8 MB memory to 16 MB when the hashtable gets really large and you may never fill up the 16 MB), choose a smaller grow factor. At least 1/3 is recommend (growing it from 100 to 133) I'd say, maybe let it grow by 50% each time as a compromise.

Note that all this also depends how collisions are handled. A simple way to handle them (my personal favorite) is to store the items in a linked list when there is a collision. If 3 items are placed at the same key, there are still only up to 3 compares to find it. Since linked list are very ineffective for searching, you may want to increase capacity earlier, e.g. if 60% capacity is used to keep the hashtable fast. OTOH, you can do something more sophisticated and keep stats about the number of collisions. As long as you hardly have any collisions (if you have a very good hash function) there is no need to re-hash at all, even if 99% of its capacity is in use. Also if you handle collisions in a sophisticated way (e.g. each node is again a sorted table and you can perform binary search within these) your lookup might still be fast enough if the table is loaded to 200% (so you have twice as many items as capacity). In that case you could keep stats how big the largest sorted table is and when it gets bigger than, let's say, 8 entries, you think this is getting too slow and then you re-hash.

Re-hashing is very slow, so it should be avoided as often as possible. Thus if you need to re-hash, don't just grow capacity too little, otherwise you have to re-hash again pretty soon when adding more items. So when you need to re-hash, make the capacity significantly larger than the number of items currently in the table, everything else is too little capacity.

Mecki
A: 

Depends on the type of hash table you are building. If you are using a fixed array based hash table (as opposed to linked lists for buckets), you should resize the array either when the table is full or when you have hit a max probe count (depending on whether you care more about speed or memory). If you are using linked lists, memory isn't as much of a concern since and don't have to probe for empty spaces, so resizing isn't as big of a deal.

The key with hash tables is the hashing algorithm, not the number of buckets. Ideally, you always want at most one item in each bucket, so you should ideally be resizing when the number of items in the hash table = the number of buckets. If your data isn't evenly distributed, you are better of with a better hash algorithm than a better resize strategy.

jezell
+2  A: 

There are typically two types of hashtables: open and closed.

In an open hashtable you find the right bucket based on the hash, and then build a list of items hanging off that bucket.

In a closed hashtable you find the initial bucket using the hash value, and if it is occupied you probe for the next value. In the simplistic case you can do this by looking for the next free bucket, or you can create a second hash value from your item and step by that (though you must ensure that this is prime modulo the hash tables size so you will visit all the buckets).

An open hashtable typically isn't resized. You set the initial size to be what you feel is reasonable for the problem. As others have pointed out you could resize on an open hashtable, but reasoning about the performance of this data structure now becomes very hard. If you resize when the length of a given bucket is L then you might end up resizing on just L items in the whole hashtable, which is very inefficient.

A closed hashtable is resized when the load factor (no. of items in the hashtable / no. of buckets) hits some predefined value. I tend to use 80%, but the exact value is unlikely to be too critical.

The benefit of a closed hashtable is that the amortized cost of inserting an item is always O(1) (assuming a good hash function). Inserting a particular item might be O(N) due to the cost of resizing, but that is done very infrequently.

Rob Walker
A: 

If you use Linear Hashing, the table itself automatically takes care of resizing, by maintaining a constant load factor.

George V. Reilly