views:

205

answers:

4

According to various sources, such as Wikipedia and various .edu websites found by Google, the most common ways for a hash table to resolve collisions are linear or quadratic probing and chaining. Randomized probing is briefly mentioned but not given much attention. I've implemented a hash table that uses randomized probing to resolve collisions. Assuming there is a collision, resolution works as follows:

  1. The full (32-bit) hash of an object is used to seed a linear congruential random number generator.
  2. The generator generates 32-bit numbers and the modulus is taken to determine where in the hash table to probe next.

This has the very nice property that, regardless of how many hash collisions there are in modulus space, lookup and insertion times are expected to be O(1) as long as there are few collisions in full 32-bit hash space. Because the probe sequence is pseudo-random, no clustering behavior results from modulus space collisions, unlike with linear probing. Because the entire system is open-addressed and doesn't use linked lists anywhere, you don't need to perform a memory allocation on each insertion, unlike chaining.

Furthermore, because the size of the hash is usually the size of the address space (32 bits on 32-bit machines), it is simply impossible to fit enough items in address space to cause large numbers of hash collisions in full 32-bit hash space under a good hashing scheme.

Why, then, is randomized probing such an unpopular collision resolution strategy?

A: 

Wouldn't you have the problem that for insertions into a non-sparsely populated table there's no guarantee that you'll hit all the elements of the hash table before starting to iterate over duplicate elements?

As a result insertion time wouldn't be well defined.

Aaron
Given a good RNG, it should visit all buckets equally often.Insertion time isn't well defined either for linear and quadratic probing.
Thomas
You can bound the probability of that happening :)
Anna
+3  A: 

Possible reasons are that linear or quadratic probing

  • have the same worst-case time complexity (O(size of the table))
  • have the same best-case time complexity (O(1))
  • are easier to implement
  • are faster than a good RNG (since speed is a major selling point for hashtables)

But I'm not sure. Did you implement your own hashtable with another collision resolution and compare the two under different circumstances? That would be very enlightening.

Thomas
+5  A: 

One of the reasons for using linear lookup (such as double hasing) is cache locality. By making the second (rehash) function to be an addition of a small integer, most chances are that you'll hit the same cache line. It is very significant for large hashes.

Chain hashing is probably used due to its simplicity.

Anna
+1  A: 

Python's dictionary implementation does this. A very nice comment in dictobject.c says:

...
The first half of collision resolution is to visit table indices via this
recurrence:

    j = ((5*j) + 1) mod 2**i

For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...

Sure looks like a linear congruential RNG to me!

Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can't meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.

This is some of the finest code I've ever read. It's featured in chapter 18 of Beautiful Code. So I'd say you're on to something!

Jason Orendorff
Interesting. My implementation simply doesn't guarantee that it will never visit the same slot more than once. I did some monte carlo simulations and concluded that, in practice, this happens too infrequently to worry about. Even if you allow for visiting the same slot more than once, you still get **expected** O(1) lookup time.
dsimcha