views:

897

answers:

9

According to MSDN, a hash function must have the following properties:

  1. If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

  2. The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

  3. For the best performance, a hash function must generate a random distribution for all input.


I keep finding myself in the following scenario: I have created a class, implemented IEquatable<T> and overridden object.Equals(object). MSDN states that:

Types that override Equals must also override GetHashCode ; otherwise, Hashtable might not work correctly.

And then it usually stops up a bit for me. Because, how do you properly override object.GetHashCode()? Never really know where to start, and it seems to be a lot of pitfalls.

Here at StackOverflow, there are quite a few questions related to GetHashCode overriding, but most of them seems to be on quite particular cases and specific issues. So, therefore I would like to get a good compilation here. An overview with general advice and guidelines. What to do, what not to do, common pitfalls, where to start, etc.

I would like it to be especially directed at C#, but I would think it will work kind of the same way for other .NET languages as well(?).


I think maybe the best way is to create one answer per topic with a quick and short answer first (close to one-liner if at all possible), then maybe some more information and end with related questions, discussions, blog posts, etc., if there are any. I can then create one post as the accepted answer (to get it on top) with just a "table of contents". Try to keep it short and concise. And don't just link to other questions and blog posts. Try to take the essence of them and then rather link to source (especially since the source could disappear. Also, please try to edit and improve answers instead of created lots of very similar ones.

I am not a very good technical writer, but I will at least try to format answers so they look alike, create the table of contents, etc. I will also try to search up some of the related questions here at SO that answers parts of these and maybe pull out the essence of the ones I can manage. But since I am not very stable on this topic, I will try to stay away for the most part :p

A: 

When do I override object.GetHashCode()?

As MSDN states:

Types that override Equals must also override GetHashCode ; otherwise, Hashtable might not work correctly.

Related links:

Svish
+1  A: 

You should override it whenever you have a meaningful measure of equality for objects of that type (i.e. you override Equals). If you knew the object wasn't going to be hashed for any reason you could leave it, but it's unlikely you could know this in advance.

The hash should be based only on the properties of the object that are used to define equality since two objects that are considered equal should have the same hash code. In general you would usually do something like:


public override int GetHashCode()
{
    int mc = //magic constant, usually some prime
    return mc * prop1.GetHashCode() * prop2.GetHashCode * ... * propN.GetHashCode();
}

I usually assume multiplying the values together will produce a fairly uniform distribution, assuming each property's hashcode function does the same, although this may well be wrong. Using this method, if the objects equality-defining properties change, then the hash code is also likely to change, which is acceptable given definition #2 in your question. It also deals with all types in a uniform way.

You could return the same value for all instances, although this will make any algorithms that use hashing (such as dictionarys) very slow - essentially all instances will be hashed to the same bucket and lookup will then become O(n) instead of the expected O(1). This of course negates any benefits of using such structures for lookup.

Lee
I might be missing something, but if you have the same values in different properties? E.g. when objA.prop1 == objB.prop2 and objA.prop2 == objB.prop1 but objA.prop1 != objA.prop2.You could concatenate the string values and then hash that.
Eric Nicholson
@Eric - yes you could do that, although there's nothing wrong with the approach above since while objA and objB would have the same hash code, there's no need to ensure objects with the same hash code are equal, just the other way round. Your approach would could lead to better performance for a HashTable for example, and probably would be better for the case you gave.
Lee
+1  A: 

Why do I have to override object.GetHashCode()?

Overriding this method is important because the following property must always remain true:

If two objects compare as equal, the GetHashCode method for each object must return the same value.

The reason, as stated by JaredPar in a blog post on implementing equality, is that

Many classes use the hash code to classify an object. In particular hash tables and dictionaries tend to place objects in buckets based on their hash code. When checking if an object is already in the hash table it will first look for it in a bucket. If two objects are equal but have different hash codes they may be put into different buckets and the dictionary would fail to lookup the object.

Related links:

Svish
+2  A: 

Table of contents

(Will mark this as the answer to get it on top, when I am allowed to (48 hours)


Things that I would like to be covered, but haven't been yet:

  • How to create the integer (How to "convert" an object into an int wasn't very obvious to me anyways).
  • What fields to base the hash code upon.
    • If it should only be on immutable fields, what if there are only mutable ones?
  • How to generate a good random distribution. (MSDN Property #3)
    • Part to this, seems to choose a good magic prime number (have seen 17, 23 and 397 been used), but how do you choose it, and what is it for exactly?
  • How to make sure the hash code stays the same all through the object lifetime. (MSDN Property #2)
    • Especially when the equality is based upon mutable fields. (MSDN Property #1)
  • How to deal with fields that are complex types (not among the built-in C# types).
    • Complex objects and structs, arrays, collections, lists, dictionaries, generic types, etc.
    • For example, even though the list or dictionary might be readonly, that doesn't mean the contents of it are.
  • How to deal with inherited classes.
    • Should you somehow incorporate base.GetHashCode() into your hash code?
  • Could you technically just be lazy and return 0? Would heavily break MSDN guideline number #3, but would at least make sure #1 and #2 were always true :P
  • Common pitfalls and gotchas.
Svish
+1  A: 

What are those magic numbers often seen in GetHashCode implementations?

They are prime numbers. Prime numbers are used for creating hash codes because prime number maximize the usage of the hash code space.

Specifically, start with the small prime number 3, and consider only the low-order nybbles of the results:

  • 3 * 1 = 3 = 3(mod 8) = 0011
  • 3 * 2 = 6 = 6(mod 8) = 1010
  • 3 * 3 = 9 = 1(mod 8) = 0001
  • 3 * 4 = 12 = 4(mod 8) = 1000
  • 3 * 5 = 15 = 7(mod 8) = 1111
  • 3 * 6 = 18 = 2(mod 8) = 0010
  • 3 * 7 = 21 = 5(mod 8) = 1001
  • 3 * 8 = 24 = 0(mod 8) = 0000
  • 3 * 9 = 27 = 3(mod 8) = 0011

And we start over. But you'll notice that successive multiples of our prime generated every possible permutation of bits in our nybble before starting to repeat. We can get the same effect with any prime number and any number of bits, which makes prime numbers optimal for generating near-random hash codes. The reason we usually see larger primes instead of small primes like 3 in the example above is that, for greater numbers of bits in our hash code, the results obtained from using a small prime are not even pseudo-random - they're simply an increasing sequence until an overflow is encountered. For optimal randomness, a prime number that results in overflow for fairly small coefficients should be used, unless you can guarantee that your coefficients will not be small.

Related links:

Svish
A: 
public override int GetHashCode()
{
    return IntProp1 ^ IntProp2 ^ StrProp3.GetHashCode() ^ StrProp4.GetHashCode ^ CustomClassProp.GetHashCode;
}

Do the same in the customClass's GetHasCode method. Works like a charm.

Burnsys
@Burnsys: Why? What's the theory behind it? How can you be sure that it "works like a charm"?
Johann Gerell
A: 

What fields to base the hash code upon? If it should only be on immutable fields, what if there are only mutable ones?

It doesn't need to be based only on immutable fields. I would base it on the fields that determine the outcome of the equals method.

Matthijs Wessels
It is a good idea to base it on immutable fields for the reason mentioned by Svish:"you "lose" an object if the object used as a key suddenly had its hash code changed". But I think it depends on your purpose for the object.
Matthijs Wessels
A: 

How to make sure the hash code stays the same all through the object lifetime. (MSDN Property #2) Especially when the equality is based upon mutable fields. (MSDN Property #1)

You seem to misunderstand Property #2. The hashcode doesn't need to stay the same thoughout the objects lifetime. It just needs to stay the same as long as the values that determine the outcome of the equals method are not changed. So logically, you base the hashcode on those values only. Then there shouldn't be a problem.

Matthijs Wessels
Hm... not sure if I follow. Wouldn't you "lose" an object if the object used as a key suddenly had its hash code changed?
Svish
Well I think you shouldn't change the object that is used as a key. So in that sense, it is good to base the hash on immutable fields. But property #2 doesn't state that. It is not a requirement.
Matthijs Wessels
It's a requirement in all but name. It's true that a stable hashcode is only a requirement of the hashtable (and other collection) implementations, and not of the interface itself, but that's a pretty strong requirement. Requiring knowlege that either no references to your object are stored in a hashtable or that the property you are changing doesn't change the hash seems like a pretty onerous one (or a blatant disregard for encapsulation). So a immutable hashtable is a practical requirement, even if not strictly speaking required by GetHashCode doco.
piers7
I think you just shouldn't use such an object in a Hashtable. If the object changes in such a way that the equals method changes, then you don't expect it to get the same hash. Say you have a class `Point` with two fields `X` and `Y`. Say you don't make this class immutable for performance or whatever. So `X` and `Y` have public setters. Now you must base the hash on mutable fields.
Matthijs Wessels
A: 

A) You must override both Equals and GetHashCode if you want to employ value equality instead of the default reference equality. With the later, two object references compare as equal if they both refer to the same object instance. With the former they compare as equal if their value is the same even if they refer to different objects. For example, you probably want to employ value equality for Date, Money, and Point objects.

B) In order to implement value equality you must override Equals and GetHashCode. Both should depend on the fields of the object that encapsulate the value. For example, Date.Year, Date.Month and Date.Day; or Money.Currency and Money.Amount; or Point.X, Point.Y and Point.Z. You should also consider overriding operator ==, operator !=, operator <, and operator >.

C) The hashcode doesn't have to stay constant all through the object lifetime. However it must remain immutable while it participates as the key in a hash. From MSDN doco for Dictionary: "As long as an object is used as a key in the Dictionary<(Of <(TKey, TValue>)>), it must not change in any way that affects its hash value." If you must change the value of a key remove the entry from the dictionary, change the key value, and replace the entry.

D) IMO, you will simplify your life if your value objects are themselves immutable.