views:

280

answers:

5

http://msdn.microsoft.com/en-us/library/system.object.gethashcode(VS.80).aspx says:

For the best performance, a hash function must generate a random distribution for all input.

Does it have any effect in performance or it's ok to use a function (like return this.Id) that doesn't give a "random distribution" but it doesn't cause more collisions?

+3  A: 

return this.Id would usually be fine (especially if Id is immutable and unique) - the main idea is to avoid collisions. However, think also about pending data - what is the Id of the 27 rows you haven't saved yet?

Also note that the GetHashCode and Equals implementations must agree.

Marc Gravell
+1  A: 

Using this.Id is generally okay. The underpinning is that you don't want too many collisions which would end up in the same bucket. The bucket number is generally obtained by taking the hash code and considering it "mod x" where x is the number of buckets in your hash table, and is usually prime (or a probable prime).

If you're just using increasing IDs (1, 2, 3, 4...) that's going to end up being pretty random as far as the bucket distribution is concerned. It's only if your ID followed a pattern which might well give the same bucket number for lots of entries that you'd need to worry.

Jon Skeet
A: 

Seems poorly worded... I think they mean hashcodes should be "uniformly distributed" over all possible int values (.net experts please correct me if I'm wrong), which would help minimize collisions.

Here's an illustration: Suppose all my hashcodes were in the range 1 through 10. If I were to use the hashcode to compute an array index where the array has length 100, then I can only get at most 10 different indices. This means my array is poorly utilized, and I would get a lot of collisions.

Zach Scrivena
Having lots of hashcodes in the same small range would be a problem - but [1-100] (each number once) is just as good as 100 values spread evenly across all values.
Jon Skeet
I think that's true if you use the "low-order" bits as your indices (as in your answer). If I used the "high-order" bits, then I wouldn't do so well.
Zach Scrivena
In fact, it could be better... with a dense distribution, you are likely to get reasonable distribution between buckets. If you get unlucky with a sparse (but equal-spaced) distribution, you could get all the data in one bucket. Very much a corner case, though - as it fixes once the buckets resize.
Marc Gravell
@Marc: As in Jon's comments, I think this depends on how you convert hashcodes to indices.
Zach Scrivena
Note that int.GetHashCode() is "return this;", and many things are keyed by int...
Marc Gravell
@Marc: True, because there is no a priori distribution to speak of, so we can assume all int are equally likely, i.e. uniformly distributed.
Zach Scrivena
A: 

It could have an effect on eg. hashtables that hash into buckets depending on the high bits (not common). Also, if your ids are for example all divisible by four, this could make a hashtable that hashes into bucket hash_code%buckets use only every fourth bucket.

jpalecek
Note that int.GetHashCode() is "return this;", and many things are keyed by int...
Marc Gravell
A: 

I prefer to use

this.Id.GetHashCode();

I think this makes it more likely that hashes will be distributed properly as opposed to using Id directly.

dbkk
Well, it is likely that `Id` is an int - and int.GetHashCode() is `return this;`...
Marc Gravell
@Marc, sure, but we're not supposed to care about implementation details of id's type. So, "id.GetHashCode()" and "id" are same in practice, different in principle.
dbkk