views:

173

answers:

3

I have a problem with a custom object that needs to be keyed for a table. I need to generate a unique numeric key. I'm having collision problems and I'm wondering if I can leverage a dictionary to help me. Assume I have an object like this:

class Thingy
{
    public string Foo;
    public string Bar;
    public string Others;
}

and so on with more fields. Lets say Foo and Bar are my key fields - if they're equal between two Thingys, then the two objects should be considered equal (one may represent an update to the other, with Others fields being updated.) So I have these:

public override bool Equals(object obj)
{
    Thingy thing = (Thingy)obj; // yes I do type check first
    return (this.Foo == thing.Foo && this.Bar == thing.Bar);
}

public override int GetHashCode()
{
    return (this.Foo + this.Bar).GetHashCode(); // using default string impl
}

so this works for the most part, but there are rare occasions where two Thingys that are actually different have the same hash code.

My question is this: could I use a Dictionary<Thingy, int> where I put in my Thingys, and use a sequential value coming out of the dictionary as my actual key? I'm wondering if the Dictionary, when detecting a rare hash code collision, will call my Equals method, determine that the objects are actually different, and store them differently. I imaging then when looking it up, it would see a bucket for that hash and search for the correct Thingy, again using Equals for comparison.

Is this the case with dictionary, or does it only resolve collisions where the hash code is different, but (hash % size) is the same? If this won't work, what might?

+8  A: 

Hash collisions only affect performance, not integrity.

A simple test would be to change GetHashCode() to simply return 1;. You'll note that the dictionary still behaves properly, but with any reasonable dataset, it will perform terribly.

Bob
Nice way to illustrate the point.
itowlson
+7  A: 

Hash collisions will primarily affect performance - not correctness. So long as Equals() behaves correctly.

Dictionary uses the hash code as a way to organize items into separate "buckets". If too many items share the same hash code, you can run into performance problems. However, as long as Equals() can correctly distinguish between instances, you should get correct results.

Where hash codes can result in problems is with mutable objects. If your Thingy class allows Foo or Bar to change for an item in the dictionary, you may then fail to find it in a subsequent access attempt. This is because the hash code produced now differs from the one used to store the value in the dictionary.

LBushkin
This is actually true of any dictionary. All dictionary types assume constant keys.
Joel
For mutable objects, you generally want to leave the base object.Equals() method alone, as it returns reference equality. You usually want the == overload to test value equality.So if you leave the default object.Equals() alone, you can use mutable objects as dictionary keys without side effects.
Bob
Overriding operator == in non-immutable types is generally not recommended. The MSDN documentation actually discuses the cases where you may want to override `Object.Equals()` and the `==` operator. http://msdn.microsoft.com/en-us/library/ms173147%28VS.80%29.aspx
LBushkin
Didn't read that article before. Always good to know proper semantics and best practices. Thanks!
Bob
A: 

GetHashCode is designed for use in hash tables, where collisions need to be minimized but not eliminated. If you need to generate a truly unique key, GetHashCode is a reasonable starting point (and not as excessively long as a guid), but you will need to store the key as part of the object and maintain a list of used keys seperately.

While you may be able to retrieve something that looks usable from the internals of Dictionary, it probably won't work reliably - for example if you add more items than the dictionary was initially allocated to handle, the underlying data structure will get rebuilt and individual items could end up in a completely different part of the dictionary.

Tom Clarkson
actually what I meant about using the dictionary was that I would store the object as the key to the dict, and then store a new highest int as the value, and use that value as the key to my table. So values in the dict would be sequential, and if I looked up an object, I would get the unique numeric key for the table. So internal dictionary structure is irrelevant.
Tesserex
So effectively you are using the dictionary to add an additional property to the object - hardly the most efficient method if you are working wit a custom object you can control.
Tom Clarkson