views:

85

answers:

2

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):

  1. Probability that a bin is empty, for large n is 1/e.
  2. Expected number of empty bins is n/e.
  3. Probability that a bin has k balls is <= 1/ek! (corrected).
  4. Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).

These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.

Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ). Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.

Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?

This lecture seems to cover it best, but I am no mathematician.

http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf

BTW, with high probability seems to mean 1 - 1/n.

A: 

That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.

I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.

First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.

With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.

Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.

Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.

This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.

EDIT: a proof of why the base of the log function doesn't matter with big-O notation:

log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)

1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.

mmr
Thanks for your effort. Given 'purely' random input I was looking for to verify a hash function by comparing it's performance with some theoretical result. Since Balls in Bins offers simple probabilities for easily measured values I was expecting to be able to easily verify my hash function. But then the max-load 'order-of' results was presented, however the one with the `3` looked promising - but is it `log2` or `loge` (I am thinking base e w.h.p :)?
philcolbourn
Maybe it is not possible to quantify this value, but the way the paper presented it seemed to give hope. I do take your idea to plot max-load behaviour to see if I am within a constant factor, but even with a large table of say 65k slots, the max-load w.h.p could be 4 - so the constant factor is important.
philcolbourn
Also, in reality you would not aim to fill your hash table of size N with N hashes, but this set point seems to allow any hash function to be tested which would be nice and keep hash function performance arguments in check - for me, to be able to say that a hash function behaves correctly is worth a lot more than to have someone say that "this hash function works well for long text strings".
philcolbourn
@philcolbourn: My point (which, as I reread what I wrote, I wasn't clear on), is that, from a theoretical standpoint, the constant and the base of the log are unimportant. Big-O notation lets you finagle those around. Your particular implementation will have a certain constant and base, and modifications to your implementation should change that constant and base. If you expect your user to have hash tables with less than 1000 elements, that would (I assume) be a different implementation than one with 100 million. or at least some parameters will have to change.
mmr
So, theoretically, if max load is 3, 4, or 10, it doesn't matter. The particular hash table scheme they presented has a value of 3, base e (if I'm reading it right). If you've implemented what they suggest, you can test to see whether or not you actually get 3, base e, through empirical datasets.
mmr
I'm convinced that an accurate figure can be determined for a given hash table and a good (random-like) hash function. I understand the `O(n)` ideas and they are valuable. (Having read lots more and built a simulator) if I decide that `w.h.p` is `1e-4` then I can find `Pr[balls>=x]~1e-4`. Now I have my `max-load` target `x` to test against - seems to work ok.
philcolbourn
A: 

After some more research and trial-and-error I think I can provide something part way to to an answer.

  1. To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.

  2. max-load can be defined for any probability you like. The typical formula used is

    1-1/n**c

Most papers on the topic use

1-1/n

An example might be easiest.

Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.

The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).

Using the formula for the probability of getting exactly k identical hash values

Pr[ exactly k ] = ((e/k)**k)/e

then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.

eg.

Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003     // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007

So, in this case, the max-load is 5.

So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.

If it isn't then this could be due to the following reasons:

  1. Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).

  2. Your hash function doesn't work well with your data - try it with random data.

  3. Your hash function doesn't work well.

The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.

Incidentally, my hash function worked nicely - except on short (1..4 character) strings.

I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.

I hope this helps.

philcolbourn