views:

164

answers:

2

I've written a custom hashing for my custom key in stdext::hash_map and would like to check whether the hasher is good. I'm using STL supplied with VS 2008. A typical check, as I know, is to check the uniformity of distribution among buckets.

How should I organize such a check correctly? A solution that comes to my mind is to modify STL sources to add a method to hash_map that walks through buckets and does the subject. Is there are any better ways?

Maybe, derive from hash_map and create there such method?

+3  A: 

Your best bet might be to just take your hashing algorithm to an array of ints and count the number of times that each hash bucket is hit, given real-world data. (I'm suggesting taking the STL out of the equation here, really.)

If you end up seeing high deviation in your counts with large sets of real-world data, your hashing algorithm is generating lots of collisions when there are plenty of empty (or emptier) buckets available.

Note that 'high deviation' is a relative term. A good hash algorithm is a deterministic random process and any random process has a chance of generating strange results, so test often, test well, and wherever possible, use your actual problem domain as a source of your tests and your controls.

Yes, that's what I'm going to do. But AFAIK I can't access buckets (and amount of items in them) outside of hash_map. Now I see two ways: modify STL sources or derive own class from hash_map with an appropriate method. I'd prefer the second solution. Are there any other ways?
flashnik
...which is why user30997 says in the firts paragraph to take STL out of the picture and just run your hash method over real data and incrementing counters.
vladr
flashnik
+2  A: 

I'd run one (large) dataset through stl::hash_map. Once done, I'd collect the results for all buckets using the following method

From hash_map:

size_type elems_in_bucket (size_type __n) const;

Finally, I would do compute the standard deviation (SD) of the elem-to-bucket distribution.

I'd do the above for different hash functions. Whichever hash function results in minimum SD is the winner (for this dataset).

ArunSaha
Yes, that would be perfect, but I don't have this members out of the box. SHould I define _HAS_TRADITIONAL_STL? Which side effects would it cause?
flashnik
I've found :) In my compiler's STL (MS VS 2008) this method is called `bucket_size`. Great thanks!
flashnik
@flashnik: You are welcome. Feel free to comment back if you have any follow-up question.
ArunSaha
@ArunSaha Thank you.
flashnik