ansaurus

Question

Why isn't randomized probing more popular in hash table implementations?

Answer 1

A:

Wouldn't you have the problem that for insertions into a non-sparsely populated table there's no guarantee that you'll hit all the elements of the hash table before starting to iterate over duplicate elements?

As a result insertion time wouldn't be well defined.

Aaron 2009-11-10 18:23:14

Given a good RNG, it should visit all buckets equally often.Insertion time isn't well defined either for linear and quadratic probing.

Thomas 2009-11-10 18:25:57

You can bound the probability of that happening :)

Anna 2009-11-10 18:27:25

Answer 2

+3 A:

Possible reasons are that linear or quadratic probing

have the same worst-case time complexity (O(size of the table))
have the same best-case time complexity (O(1))
are easier to implement
are faster than a good RNG (since speed is a major selling point for hashtables)

But I'm not sure. Did you implement your own hashtable with another collision resolution and compare the two under different circumstances? That would be very enlightening.

Thomas 2009-11-10 18:30:55

Answer 3

+5 A:

One of the reasons for using linear lookup (such as double hasing) is cache locality. By making the second (rehash) function to be an addition of a small integer, most chances are that you'll hit the same cache line. It is very significant for large hashes.

Chain hashing is probably used due to its simplicity.

Anna 2009-11-10 18:32:42

Answer 4

+1 A:

Python's dictionary implementation does this. A very nice comment in dictobject.c says:

...
The first half of collision resolution is to visit table indices via this
recurrence:

    j = ((5*j) + 1) mod 2**i

For any initial j in range(2**i), repeating that 2**i times generates each
int in range(2**i) exactly once (see any text on random-number generation for
proof).
...

Sure looks like a linear congruential RNG to me!

Note that the full state of such an RNG is only i bits--has to be, to avoid revisiting entries--so you can't meaningfully use "[t]he full (32-bit) hash of an object" to seed the RNG. Python initially seeds j with i bits from the hash. If there is another collision, it grabs another 5 bits from the hash and throws those into the mix. (Read the rest of that comment, particularly where it talks about PERTURB_SHIFT.) It continues that way, adding more bits with each collision, until it has used up the whole hash code. This way Python uses a decent amount of whatever randomness the hash code offers, and the code is simple and fast.

This is some of the finest code I've ever read. It's featured in chapter 18 of Beautiful Code. So I'd say you're on to something!

Jason Orendorff 2009-11-19 22:16:33

Interesting. My implementation simply doesn't guarantee that it will never visit the same slot more than once. I did some monte carlo simulations and concluded that, in practice, this happens too infrequently to worry about. Even if you allow for visiting the same slot more than once, you still get **expected** O(1) lookup time.

dsimcha 2009-11-19 22:49:16

ansaurus

tags:

views:

answers:

Why isn't randomized probing more popular in hash table implementations?

related questions