Chained Hash Tables vs. Open-Addressed Hash Tables

Wikipedia's article on hash tables gives a distinctly better explanation and overview of different hash table schemes that people have used than I'm able to off the top of my head. In fact you're probably better off reading that article than asking the question here. :)

That said...

A chained hash table indexes into an array of pointers to the heads of linked lists. Each linked list cell has the key for which it was allocated and the value which was inserted for that key. When you want to look up a particular element from its key, the key's hash is used to work out which linked list to follow, and then that particular list is traversed to find the element that you're after. If more than one key in the hash table has the same hash, then you'll have linked lists with more than one element.

The downside of chained hashing is having to follow pointers in order to search linked lists. The upside is that chained hash tables only get linearly slower as the load factor (the ratio of elements in the hash table to the length of the bucket array) increases, even if it rises above 1.

An open-addressing hash table indexes into an array of pointers to pairs of (key, value). You use the key's hash value to work out which slot in the array to look at first. If more than one key in the hash table has the same hash, then you use some scheme to decide on another slot to look in instead. For example, linear probing is where you look at the next slot after the one chosen, and then the next slot after that, and so on until you either find a slow that matches the key you're looking for, or you hit an empty slot (in which case the key must not be there).

Open-addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes to . It gets very, very slow if the load factor approaches 1, because you end up usually having to search through many of the slots in the bucket array before you find either the key that you were looking for or an empty slot. Also, you can never have more elements in the hash table than there are entries in the bucket array.

To deal with the fact that all hash tables at least get slower (and in some cases actually break completely) when their load factor approaches 1, practical hash table implementations make the bucket array larger (by allocating a new bucket array, and copying elements from the old one into the new one, then freeing the old one) when the load factor gets above a certain value (typically about 0.7).

There are lots of variations on all of the above. Again, please see the wikipedia article, it really is quite good.

For a library that is meant to be used by other people, I would strongly recommend experimenting. Since they're generally quite performance-crucial, you're usually best off using somebody else's implementation of a hash table which has already been carefully tuned. There are lots of open-source BSD, LGPL and GPL licensed hash table implementations.

If you're working with GTK, for example, then you'll find that there's a good hash table in GLib.

Excellent explanation. One thing I've recently learned that most summaries neglect to point out is that deletions adversely affect performance in open addressing tables. When you delete you only mark the entry as deleted. When inserting you can re-use a deleted entry, but when searching, you cannot stop on a deleted entry. If you do lots of insertions and deletions, then over time you accumulate deleted entries that count against the load factor. Thus performance degrades to O(n), even if the actual load remains low. If you don't delete, open addressing is great.

Adrian McCarthy 2010-07-29 16:57:58

ansaurus

tags:

views:

answers:

Chained Hash Tables vs. Open-Addressed Hash Tables

related questions