tags:

views:

36

answers:

3

We are currently extensively using the GetHashCode method to store hash codes in a database for tracking unique items. MSDN has a scary entry about this here

"The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes."

We have been using this approach for several years without issue. Should we be worried, and if so what would be a better approach?

To elaborate, the data is coming from an external source. We are taking two to three string fields, adding them together into a new string, and then using the GetHashCode off of that.

+2  A: 

Yes. Be scared. GetHashCode cannot possibly offer a guarantee of no-collision on any type larger than 32bits. Given that in some cases the implementation of GetHashCode might be less than perfect (i.e. some classes implement their own ill-distributed version), the risk might be higher in some cases. Regardless, this is a bad approach and needs a rethink.

I'd suggest a bit of reading on how hash tables work so that you better understand the purpose of a hash code. It's really only a heuristic measure for speedy storage.

spender
A: 

GetHashCode is not reliable.

You have two choices in this regard:

  1. Override the GetHashCode method and have it return a Guid instead of an integer.
  2. Let your DB create unique id values for you.
Brian Driscoll
Misappropriating GetHashCode to return different values between calls is a terrible idea and will break much more than it fixes. Option 2 saves you from a -1.
spender
hmm... seems odd since MSDN recommends overriding GetHashCode to ensure that it returns unique values.
Brian Driscoll
@Brian - but you have to return the same unique value for the object every time. Generating a new Guid each time would violate that. Figuring out how to retrieve the same Guid each time is much more work than using a deterministic algorithm to construct a unique value for a complex object.
tvanfosson
See http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode
tvanfosson
+2  A: 

Using a hash code as a unique identifier is a really bad idea because you're eventually guaranteed to have collisions if the collection is large enough -- and it doesn't have to be very large before you're statistically likely to have a collision. Hash codes are a good, quick way to evaluate if two objects are the same when (assuming the same hash function) - if they hash to different values, they are definitely different. If they hash to the same value, however, then you need to do an equality comparison to make sure that they are the same object. At that point you need to compare the properties of the object that make it unique, i.e., if these properties are the same, then the objects are the same.

I'd suggest using a unique index in the database on the natural key properties in conjunction with an artificial, autoincrement id as the primary key. Then you can be sure that you don't get duplicate insertions in the DB (uniqueness constraint of the index), but you can quickly compare the objects outside the DB by simply comparing whether they have the same id -- also guaranteed to be unique by the primary key constraint.

tvanfosson