tags:

views:

182

answers:

4

I have the following structures defined (names are anonymised, but data types are correct):

Public Type ExampleDataItem
    Limit As Integer    ' could be any value 0-999
    Status As Integer   ' could be any value 0-2
    ValidUntil As Date  ' always a valid date
End Type

Public Type ExampleData
    Name As String      ' could be 5-20 chars long
    ValidOn As Date     ' could be valid date or 1899-12-30 representing "null"
    Salt As Integer     ' random value 42-32767
    Items(0 To 13) As ExampleDataItem
End Type

I would like to generate a 32-bit hash code for an ExampleData instance. Minimising hash collisions is important, performance and data order is not important.

So far I have got (in pseudocode):

  1. Serialise all members into one byte array.
  2. Loop through the byte array, reading 4 bytes at a time into a Long value.
  3. XOR all the Long values together.

I can't really post my code because it's heavily dependent on utility classes to do the serialisation, but if anyone wants to see it regardless then I will post it.

Will this be OK, or can anyone suggest a better way of doing it?

EDIT:

This code is being used to implement part of a software licensing system. The purpose of the hash is to confirm whether the data entered by the end user equals the data entered by the tech support person. The hash must therefore:

  1. Be very short. That's why I thought 32 bits would be most suitable, because it can be rendered as a 10-digit decimal number on screen. This is easy, quick and unambiguous to read over the telephone and type in.
  2. Be derived from all the fields in the data structure, with no extra artificial keys or any other trickery.

The hash is not required for lookup, uniqueness testing, or to store ExampleData instances in any kind of collection, but only for the one purpose described above.

A: 

You may be overthinking it, or I'm not understanding the issue. You could essentially just hash(CStr(Salt) + Name + CStr(ValidOn) + Anyotherstrings).

There is no particular need to go through the process of serializing into byte array and XORing values. Infact XORing values together in that way is more likely to create hash collisions where you arent intending them.

Edit: I think I understand now. You're creating your own hash value by XORing the data together? It's unfortunately quite likely to give collisions. I know VB6 doesnt include any hashing algorithms, so you may be best importing and using something like Phil Fresle's SHA256 implementation.

PaulG
All the data in the structures must contribute to the hash in some way. I don't care how I accomplish this as long as collisions are minimised. The byte array and XOR is simply my first stab at a solution.
Christian Hayter
A: 

EDIT: the question has now been edited to clarify that the goal is detecting typing errors, not minimizing collisions between totally different values. In that case Dan F's answer is the best one IMHO, not my offering below (wonderful though it is).


You could use the Microsoft CryptoAPI rather than rolling your own hash algorithm.

  • For instance this Microsoft article on using CryptoAPI from VB6 should get you started.
  • Or this from Edanmo on mvps.org for hashing a string in VB6.

EDIT: Following comment. If you insist on a 32-bit value, it will be hard to minimize hash collisions. My algorithm book suggests using Horner's method as a decent general purpose hashing algorithm. I don't have time right now to find out more information and implement in VB6. CopyMemory would probably be useful :)

MarkJ
Good suggestion, but I really do only want a 32-bit value. Sorry!
Christian Hayter
A: 

Considering that performance is not an objective, if file size is not important and you want a unique value for each item. Just add an ID field. It data type is a string. Then use this function to generate a GUID. This will be a unique ID. Use it as a key for a dictonary or collection.

Public Type GUID
    Data1 As Long
    Data2 As Integer
    Data3 As Integer
    Data4(7) As Byte
End Type

Public Type GUID2              '15 BYTES TOTAL
    Data1(14) As Byte
End Type

Public Declare Function CoCreateGuid Lib "OLE32.DLL" (pGuid As GUID) As Long

Public Function GetGUID() As String
    Dim VBRIG_PROC_ID_STRING As String
    VBRIG_PROC_ID_STRING = "GetGUID()"

    Dim lResult As Long
    Dim lguid As GUID
    Dim MyguidString As String
    Dim MyGuidString1 As String
    Dim MyGuidString2 As String
    Dim MyGuidString3 As String
    Dim DataLen As Integer
    Dim StringLen As Integer
    Dim i As Integer
    On Error GoTo error_olemsg
    lResult = CoCreateGuid(lguid)
    If lResult = 0 Then
        MyGuidString1 = Hex$(lguid.Data1)
        StringLen = Len(MyGuidString1)
        DataLen = Len(lguid.Data1)
        MyGuidString1 = LeadingZeros(2 * DataLen, StringLen) & MyGuidString1
        'First 4 bytes (8 hex digits)
        MyGuidString2 = Hex$(lguid.Data2)
        StringLen = Len(MyGuidString2)
        DataLen = Len(lguid.Data2)
        MyGuidString2 = LeadingZeros(2 * DataLen, StringLen) & Trim$(MyGuidString2)
        'Next 2 bytes (4 hex digits)
        MyGuidString3 = Hex$(lguid.Data3)
        StringLen = Len(MyGuidString3)
        DataLen = Len(lguid.Data3)
        MyGuidString3 = LeadingZeros(2 * DataLen, StringLen) & Trim$(MyGuidString3)
        'Next 2 bytes (4 hex digits)
        GetGUID = MyGuidString1 & MyGuidString2 & MyGuidString3
        For i = 0 To 7
            MyguidString = MyguidString & Format$(Hex$(lguid.Data4(i)), "00")
        Next i
        'MyGuidString contains last 8 bytes of Guid (16 hex digits)
        GetGUID = GetGUID & MyguidString
    Else
        GetGUID = "00000000" ' return zeros if function unsuccessful
    End If
    Exit Function
error_olemsg:
    GetGUID = "00000000"
    Exit Function
End Function

Public Function LeadingZeros(ExpectedLen As Integer, ActualLen As Integer) As String
    LeadingZeros = String$(ExpectedLen - ActualLen, "0")
End Function
RS Conley
I think you have misunderstood my requirements - not surprising because I didn't give much detail. I have added more info to the question which should clarify.
Christian Hayter
+3  A: 

Can you use the CRC32? Steve McMahon has an implementation. Combine that with a bit of base32 encoding and you've got something short enough to read over the phone.

Dan F
CRC32 looks like the most practical solution, thanks Dan.
Christian Hayter
+1 Steve McMahon's code is always excellent, and now the question has been clarified CRC is clearly a good choice. You have accidentally linked to the Vb.NET code - I have edited your answer to link to Steve McMahon's VB6 version instead.
MarkJ
Thanks Mark. It wasn't difficult to find the right code anyway. :-)
Christian Hayter
Ah, whoops, thanks Mark. It all blurs into one some days :-)
Dan F