views:

137

answers:

3

I've been puzzling over this for a few days... feel free to shoot down any of my assumptions.

We're using a Dictionary with integer keys. I assume that the value of the key in this case is used directly as the hash. Does this mean (if the keys are grouped over a small range) that the distribution of the key hash (same as the key itself, right?) will be in a similarly small range, and therefore a bad choice for a hashtable?

Would it be better to provide an IEqualityComparer that did something clever with primes and modulo mathematics to calculate a better distributed hash?

+4  A: 

It's not used directly in that the dictionary will still ask the key for its hash - but the hash value of an Int32 is just the value, so the thrust of your question is relevant, yes.

I believe that the way the .NET dictionary works doesn't rely on hash values being uniformly distributed. It takes hash % bucketCount where bucketCount is always prime. (That's from memory though - I could be wrong.)

You could still end up with an inefficient set of keys of course, if they happen to be spaced by the bucket count. That will always be the case though - a hash table would only ever be genuinely O(1) for all keys if they had unique hash values and the table maintained a set of buckets for every possible hash :) In reality it tends not to be a problem. If you happen to know that it will be a problem, then yes, a custom IEqualityComparer<T> could help.

Jon Skeet
Just checked with reflector... you are right with hash % bucketcount. Knowing that, it all falls into place. Thanks Jon.
spender
A: 

Assuming you're using a standard library hash table implementation, chances are the key is not the hash, even if the key is an integer, for exactly the reason that you point out.

So while your logic regarding hash distributions is correct, your initial assumption that integer keys would mean that hashes = keys is probably not.

If I'm wrong re: .NET then oh well; this is more of a generalized answer. :)

Amber
I think it's fairly common for the hash of a numeric type to just be the value, assuming it fits into the hash range.
Jon Skeet
A potential issue that you run into that though is with patterns in sequences of numbers - if you get unlucky with the pattern width being a multiple of your bucketCount, you run into issues.
Amber
Exactly as my post mentions... but *any* hash algorithm can wind up with that problem if you're unlucky.
Jon Skeet
Sure. However simply due to the nature of data patterns tend to occur far more in standard input than in the output of a hash function specifically designed to mix things up a bit.
Amber
A: 

Before doing something clever I'd test the speed of it as-is, and see if it's suitable for you. If it isn't, then try the clever thing. But I would expect it's better to leave it alone; it's more important that the hashes don't collide, and as long as that's happening, life will be fine.

Noon Silk