views:

70

answers:

2

I have seen that a prime number implmentation of the GetHashCode function is being recommend, for example here. However using the following code (in VB, sorry), it seems as if that implementation gives the same hash density as a "naive" xor implementation. If the density is the same, I would suppose there is the same probability of cllision in both implementations. Am I missing anything on why is the prime approach preferred?

I am supossing that if the hash code is a byte I do not lose generality for the integer case.

Sub Main()
    Dim XorHashes(255) As Integer
    Dim PrimeHashes(255) As Integer

    For i = 0 To 255
        For j = 0 To 255
            For k = 0 To 255
                XorHashes(GetXorHash(i, j, k)) += 1
                PrimeHashes(GetPrimeHash(i, j, k)) += 1
            Next
        Next
    Next

    For i = 0 To 255
        Console.WriteLine("{0}: {1}, {2}", i, XorHashes(i), PrimeHashes(i))
    Next
    Console.ReadKey()
End Sub

Public Function GetXorHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
    Return CByte((valueOne Xor valueTwo Xor valueThree) Mod 256)
End Function

Public Function GetPrimeHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
    Dim TempHash = 17
    TempHash = 31 * TempHash + valueOne
    TempHash = 31 * TempHash + valueTwo
    TempHash = 31 * TempHash + valueThree

    Return CByte(TempHash Mod 256)
End Function
+2  A: 

The probability of collisions also depends on the expected distribution of the input data. In your example you assume input data that is uniformly distributed over the entire range. This is the ideal situation and it's no surprise that both algorithms perform well.

However, if you assume that the input data generally is similar in the high bits and differs mostly only in the low bits (note: a lot of real data is like this), the prime number method will spread this variation out over the whole hash whereas the XOR method will not - small changes in the low bits of two or more values can easily cancel each other out when XOR'ed. So the prime number method is less likely to collide in this case.

Also you should use 32-bit values for GetHashCode, not 8-bit values.

Mark Byers
A little too lenghty to test all the possible combinations of three 32-bit hashes.
Wilhelm
@Wilhelm, It might be better instead of testing all possible values to test on some real data using 32-bit hashes and see what results you get then.
Mark Byers
+1  A: 

Truncating the hash is your problem here. The Xor method can only ever produce 256 distinct values. The Prime method can generate more than 750,000 distinct values, but you throw 749,744 of them away by using only the 8 low bits. And can thus never do a better job than Xor.

In your specific case, you can do much better. There are enough bits in an Integer to generate a unique hash with 16 million distinct values:

  Public Shared Function GetGoodHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Integer
    Return valueOne And 255 + (valueTwo And 255) << 8 + (valueThree And 255) << 16
  End Function

The Xor method is okay when the input values are well distributed. A problem with the prime method is that it is easy to trigger an Overflow exception. That's difficult to deal with in VB.NET code, it doesn't have the equivalent of the C# unchecked keyword. You have to turn that off globally with Project + Properties, Compile tab, Advanced Compile Options, tick "Remove integer overflow checks". Avoid that by computing the hash as an Int64. Which makes it a bit expensive.

Hans Passant
I only truncated to byte for test purposes, as I wanted to know how collision desity was in both methods. The result is: exactly the same overall.
Wilhelm
@Wilhelm: the problem is your test, it isn't realistic.
Hans Passant
Why it is not realistic? If I were to do a full test the results will be the same, the collision density in both methods will be equal for uniform distributed variables.
Wilhelm
@Wilhelm: review the remark I made about 256 values for xor, 750000 values for prime, 749,744 of which your test doesn't use.
Hans Passant
If I am limiting a hash code to a byte value for test purposes, why should it take the other values? I am not limiting the entry values, I am limiting the hash code size.
Wilhelm
Yes, down to 256 distinct possible values. It can never do a better job than xor. You can only get a lower collision rate if there are more possible hash values.
Hans Passant