ansaurus

Question

Why should the "prime-based" hashcode implmentation be used instead of the "naive" one?

Answer 1

+2 A:

The probability of collisions also depends on the expected distribution of the input data. In your example you assume input data that is uniformly distributed over the entire range. This is the ideal situation and it's no surprise that both algorithms perform well.

However, if you assume that the input data generally is similar in the high bits and differs mostly only in the low bits (note: a lot of real data is like this), the prime number method will spread this variation out over the whole hash whereas the XOR method will not - small changes in the low bits of two or more values can easily cancel each other out when XOR'ed. So the prime number method is less likely to collide in this case.

Also you should use 32-bit values for GetHashCode, not 8-bit values.

Mark Byers 2010-03-15 07:16:00

A little too lenghty to test all the possible combinations of three 32-bit hashes.

Wilhelm 2010-03-15 07:26:57

@Wilhelm, It might be better instead of testing all possible values to test on some real data using 32-bit hashes and see what results you get then.

Mark Byers 2010-03-15 07:35:38

Answer 2

+1 A:

Truncating the hash is your problem here. The Xor method can only ever produce 256 distinct values. The Prime method can generate more than 750,000 distinct values, but you throw 749,744 of them away by using only the 8 low bits. And can thus never do a better job than Xor.

In your specific case, you can do much better. There are enough bits in an Integer to generate a unique hash with 16 million distinct values:

  Public Shared Function GetGoodHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Integer
    Return valueOne And 255 + (valueTwo And 255) << 8 + (valueThree And 255) << 16
  End Function

The Xor method is okay when the input values are well distributed. A problem with the prime method is that it is easy to trigger an Overflow exception. That's difficult to deal with in VB.NET code, it doesn't have the equivalent of the C# unchecked keyword. You have to turn that off globally with Project + Properties, Compile tab, Advanced Compile Options, tick "Remove integer overflow checks". Avoid that by computing the hash as an Int64. Which makes it a bit expensive.

Hans Passant 2010-03-15 12:38:20

I only truncated to byte for test purposes, as I wanted to know how collision desity was in both methods. The result is: exactly the same overall.

Wilhelm 2010-03-15 14:38:34

@Wilhelm: the problem is your test, it isn't realistic.

Hans Passant 2010-03-15 15:16:16

Why it is not realistic? If I were to do a full test the results will be the same, the collision density in both methods will be equal for uniform distributed variables.

Wilhelm 2010-03-16 18:42:37

@Wilhelm: review the remark I made about 256 values for xor, 750000 values for prime, 749,744 of which your test doesn't use.

Hans Passant 2010-03-16 20:17:30

If I am limiting a hash code to a byte value for test purposes, why should it take the other values? I am not limiting the entry values, I am limiting the hash code size.

Wilhelm 2010-03-16 21:23:23

Yes, down to 256 distinct possible values. It can never do a better job than xor. You can only get a lower collision rate if there are more possible hash values.

Hans Passant 2010-03-16 21:38:30

ansaurus

tags:

views:

answers:

Why should the "prime-based" hashcode implmentation be used instead of the "naive" one?

related questions