views:

66

answers:

3

Hello everyone,

I am writing a generic hash map in C++ which uses chaining to deal with collisions.

Say if I have a hash map with 11 buckets, and I insert 8 items. The hash function will distribute it as follows:

bucket[0] = empty
bucket[1] = 2 elements
bucket[2] = empty
bucket[3] = 1 element
bucket[4] = 1 element
bucket[5] = 3 elements
bucket[6] = empty
bucket[7] = 1 element
bucket[8] = empty
bucket[9] = empty
bucket[10] = empty

Calculating the spread over the buckets is 5/8 = 0.625. But how do I calculate the spread taking the depth of the buckets into account?

I want to know this because: Say if I added 20 elements, and every bucket has 1 element and the last bucket has 11 elements.

then the spread would be 1 if i calculate it the easy way, but this is obviously not correct! (the table resizes to avoid this of course, but I want to be able to show the spread) I want to use this information to be able to tune hash functions.

Thanks in advance!

+1  A: 

When I've worked on improving hash functions, I've use the sum of the squares of the lengths divided by the number of items inserted (and attempted to minimize the result). In your first example, you've inserted 8 items and the sum of the squares of the lengths is 16, so your "figure of merit" is 2.

In the second, you've inserted 20 items, and the sum of the squares is 130, so your figure of merit would be 6.5. I'd say the first was likely to be a better hash function in general (though I generally prefer to compare results from identical inputs).

Jerry Coffin
Thanks, this makes sense and I will give it a try in comparing and improving the hash functions, but it does not give me a "percentage" of spread.
toefel
+2  A: 

If you're only using this to tune the hash functions themselves, you could compute a genuine measure of statistical dispersion, such as the Gini coefficient. On the other hand, if you're trying to make this a feature of the hash-map itself, I would recommend against it - computing a complicated benchmark as part of the 'is resize necessary' logic has its own performance costs; something naïve is probably better.

Ani
Thansk, I will look into it. I am using it just for tuning the hashfunctions, for resizing I am using a configurable loadfactor (say .80)and resizestrategy (2.0).
toefel
+1  A: 

You probably care about the answer because you want to know how much work you're doing with chaining. Thus, you probably should instrument your hash map to output how much work it's doing (a few #ifdefs that increment a counter in the key methods will probably do the trick). You then can use the amount of work (# compares, #nodes followed, etc.) as a metric for your hash function, and as a bonus you get a nifty tool for performance tuning. Once you figure things out, you can remove the instrumentation.

Rex Kerr