views:

1597

answers:

14

I'm testing the VB function below that I got from a Google search. I plan to use it to generate hash codes for quick string comparison. However, there are occasions in which two different strings have the same hash code. For example, these strings

"122Gen 1 heap size (.NET CLR Memory w3wp):mccsmtpteweb025.20833333333333E-02"

"122Gen 2 heap size (.NET CLR Memory w3wp):mccsmtpteweb015.20833333333333E-02"

have the same hash code of 237117279.

Please tell me: - What is wrong with the function? - How can I fix it?

Thank you

martin


Private Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" (dest As Any, src As Any, ByVal bytes As Long)

Private Function HashCode(Key As String) As Long
  On Error GoTo ErrorGoTo

  Dim lastEl As Long, i As Long
  ' copy ansi codes into an array of long'
  lastEl = (Len(Key) - 1) \ 4
  ReDim codes(lastEl) As Long
  ' this also converts from Unicode to ANSI'
  CopyMemory codes(0), ByVal Key, Len(Key)
  ' XOR the ANSI codes of all characters'

  For i = 0 To lastEl - 1
    HashCode = HashCode Xor codes(i) 'Xor'
  Next

ErrorGoTo:
  Exit Function
End Function
+4  A: 

Hash functions do not guarantee uniqueness of hash values. If the input value range (judging your sample strings) is larger than the output value range (eg 32 bit integer), then uniqueness is physically impossible.

ΤΖΩΤΖΙΟΥ
+7  A: 

The two Strings have the same characters. (Note the '2' and the '1' that are flip-flopped)

That is why the hash value is the same.

Make sure that the hash function is taking into account the order of the characters.

jjnguy
Martin
A: 

I don't quite see the environment you work in. Is this .Net code? If you really want good hash codes, I would recommend looking into cryptographic hashes (proven algorithms) instead of trying to write your own.

Btw, could you edit your post and paste the code in as a Code Sample (see toolbar)? This would make it easier to read.

Magnus Akselvoll
It's classic VB. .Net wouldn't allow the CopyMemory syntax.
Joel Coehoorn
A: 

"Don't do that."

Writing your own hash function is a big mistake, because your language certainly already has an implementation of SHA-1, which is a perfectly good hash function. If you only need 32 bits (instead of the 160 that SHA-1 provides), just use the last 32 bits of SHA-1.

He didn't write it, read the question
jjnguy
There is no problem in creating your own hash function, if it for something like string comparison, you might what something a little bit faster then one of the cryptographic hashes.
Mr Shark
+9  A: 

I'm betting there are more than just "occasions" when two strings generate the same hash using your function. In fact, it probably happens more often than you think.

A few things to realize:

First, there will be hash collisions. It happens. Even with really, really big spaces like MD5 there are still two strings that can generate the same resulting hash. You have to deal with those collisions by creating buckets.

Second, a long integer isn't really a big hash space. You're going to get more collisions that you would if you used more bits.

Thirdly, there are libraries available to you in Visual Basic (like .NET's System.Security.Cryptography namespace) that will do a much better job of hashing than most mere mortals.

clintp
Yes, you're right. I test this function for 200K strings and there are more than 4K collisions.
Martin
What is a bucket (not looking for a big answer... something I can Google or search SO for)
Anthony Mastrean
+1  A: 

No hash function can guarantee uniqueness. There are ~4 billion 32-bit integers, so even the best hash function will generate duplicates when presented with ~4 billion and 1 strings (and mostly likely long before).

Moving to 64-bit hashes or even 128-bit hashes isn't really the solution, though it reduces the probability of a collision.

If you want a better hash function you could look at the cryptographic hashes, but it would be better to reconsider you algorithm and decide if you can deal with the collisions some other way.

Rob Walker
+1  A: 

The System.Security.Cryptography namespace contains multiple classes which can do hashing for you (such as MD5) which will probably hash them better than you could yourself and will take much less effort.

You don't always have to reinvent the wheel.

Garry Shutler
Yes, it's not .NET. I'm trying the function in VBA for Excel.
Martin
Oh ok, you should edit the question to say it's VBA because it isn't obvious that it isn't .NET and most people will assume that it is.
Garry Shutler
+1  A: 

Simple XOR is a bad hash: you'll find lots of strings which collide. The hash doesn't depend on the order of the letters in the string, for one thing.

Try using the FNV hash http://isthe.com/chongo/tech/comp/fnv/

This is really simple to implement. It shifts the hash code after each XOR, so the same letters in a different order will produce a different hash.

Rich
A: 

There's a visual basic implementation of MD5 hashing here

http://www.bullzip.com/md5/vb/md5-visual-basic.htm

ballpointpeon
+1  A: 

I fixed the syntax highlighting for him.

Also, for those who weren't sure about the environment or were suggesting a more-secure hash: it's Classic (pre-.Net) VB, because .Net would require parentheses for the the call to CopyMemory.

IIRC, there aren't any secure hashes built in for Classic VB. There's not much out there on the web either, so this may be his best bet.

Joel Coehoorn
A: 

This particular hash functions XORs all of the characters in a string. Unfortunately XOR is associative:

(a XOR b) XOR c = a XOR (b XOR c)

So any strings with the same input characters will result in the same hash code. The two strings provided are the same, except for the location of two characters, therefore they should have the same hashcode.

You may need to find a better algorithm, MD5 would be a good choice.

A: 

The XOR operation is commutative; that is, when XORing all the chars in a string, the order of the chars does not matter. All anagrams of a string will produce the same XOR hash.

In your example, your second string can be generated from your first by swapping the "1" after "...Gen " with the first "2" following it.

There is nothing wrong with your function. All useful hashing functions will sometimes generate collisions, and your program must be prepared to resolve them.

A collision occurs when an input hashes to a value already identified with an earlier input. If a hashing algorithm could not generate collisions, the hash values would need to be as large as the input values. Such a hashing algorithm would be of limited use compared to just storing the input values.

-Al.

+2  A: 

If the biggest problem is that it doesn't account for the position of the bytes, you could fix it like this:

Private Function HashCode(Key As String) As Long
  On Error GoTo ErrorGoTo

  Dim lastEl As Long, i As Long
  ' copy ansi codes into an array of long'
  lastEl = (Len(Key) - 1) \ 4
  ReDim codes(lastEl) As Long
  ' this also converts from Unicode to ANSI'
  CopyMemory codes(0), ByVal Key, Len(Key)
  ' XOR the ANSI codes of all characters'

  For i = 0 To lastEl - 1
    HashCode = HashCode Xor (codes(i) + i) 'Xor'
  Next

ErrorGoTo:
  Exit Function
End Function

The only difference is that it adds the characters position to it's byte value before the XOR.

Joel Coehoorn
+1  A: 

Hash functions are not meant to return distinct values for distinct strings. However, a good hash function should return different values for strings that look alike. Hash functions are used to search for many reasons, including searching into a large collection. If the hash function is good and if it returns values from the range [0,N-1], then a large collection of M objects will be divide in N collections, each one having about M/N elements. This way, you need to search only in an array of M/N elements instead of searching in an array of M elements.

But, if you only have 2 strings, it is not faster to compute the hash value for those! It is better to just compare the two strings.

An interresing hash function could be:



    unsigned int hash(const char* name) {
      unsigned mul=1;
      unsigned val=0;
      while(name[0]!=0) {
        val+=mul*((unsigned)name[0]);
        mul*=7; //you could use an arbitrary prime number, but test the hash dispersion afterwards
        name++;
      }
      return val;
    }

botismarius