tags:

views:

721

answers:

4

Hi, I want to hash a char array in to an int or a long. The resulting value has to adhere to a given precision value. The function I've been using is given below:

int GetHash(const char* zKey, int iPrecision /*= 6*/)
{
        /////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp

        unsigned long h = 0;
        long M = pow(10, iPrecision);

        while(*zKey)
        {
                h = (h << 4) + *zKey++;
                unsigned long g = h & 0xF0000000L;
                if (g) h ^= g >> 24;
                h &= ~g;
        }            

        return (int) (h % M);
}

The string to be hashed is similar to "SAEUI1210.00000010_1".

However, this produces duplicate values in some cases. Are there any good alternatives which wouldn't duplicate the same hash for different string values.

+2  A: 

Every hash will have collisions. Period. That's called a Birthday Problem.

You may want to check cryptographic has functions like MD5 (relatively fast and you don't care that it's insecure) but it also will have collisions.

sharptooth
Perfect hashes by definition don't.
MSalters
+6  A: 

The very definition of a hash is that it produces duplicate values for some values, due to hash value range being smaller than the space of the hashed data.

In theory, a 32-bit hash has enough range to hash all ~6 character strings (A-Z,a-z,0-9 only), without causing a collision. In practice, hashes are not a perfect permutation of the input. Given a 32-bit hash, you can expect to get hash collisions after hashing ~16 bit of random inputs, due to the birthday paradox.

Given a static set of data values, it's always possible to construct a hash function designed specifically for them, which will never collide with itself (of course, size of its output will be at least log(|data set|). However, it requires you to know all the possible data values ahead of time. This is called perfect hashing.

That being said, here are a few alternatives which should get you started (they are designed to minimize collisions)

ASk
Which is the best hashing function to use out of the ones given in the link you've provided and the one that I'm using right now.The function that I'm using seems to be more complex than djb2 and sdbm. Does that mean it's better at avoiding collisions?
Gayan
The only way to test which hash function is "best" for your purposes, is to perform a benchmark on data sample that fits your expected real data. The function that you are using does not attempt to mix the input bits together too hard to create a hash - at each step, at most 4 topmost bits are mixed in; and in strings of length < 8, even that does not happen, your hash simply accumulates all characters, with a slight bit overlap.
ASk
+2  A: 

Hashes generate the same value for different inputs -- that's what they do. All you can do is create a hash function with sufficient distribution or bit depth (or both) to minimize those collisions. Since you have this additional constraint of precision (0-5 ?) then you are going to hit collisions far more often.

Talljoe
A: 

MD5 or SHA. There are many open implementations, and the outcome is very unlikely to produce a duplicate result.

Adam Matan
Yes. But my requirement also includes the fact that the result has to be an integer. MD5 hashes contain both ints and chars. I think it's the same for SHA algorithms
Gayan
True, but the conversion is trivial - from 128 bit to 32 bit integer. You'll get a 2-line code (hash, int conversion) that produces a de-facto no collision hash.
Adam Matan