tags:

views:

531

answers:

7

I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.

I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.

The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.

Can anyone help me figure out how to write a quality hash function?

+6  A: 

Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.

If storage is the issue, you can use one of the hash functions analyzed here.

Vinko Vrsalovic
+4  A: 

Because hashing can generate collisions, you have to keep the keys in memory anyway in order to discover these collisions. Hashmaps and other standard datastructures do do this in their internal bookkeeping.

As the key is so small, just use the key directly rather than hashing. This will be faster and will ensure no collisions.

Will
A: 

Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.

Graphics Noob
A: 

Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).

Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.

Adisak
Hmmm... note sure why the downvote ?
Adisak
A: 

Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.

larsm
A: 

I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.

A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.

To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.

Tobias Langner
A: 

Here's a fairly reasonable hash function from 4 integers to 1 integer:

unsigned int hash = in[0];
hash *= 37;
hash += in[1];
hash *= 37;
hash += in[2];
hash *= 37;
hash += in[3];

With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.

There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.

Steve Jessop