views:

375

answers:

2

I need to generate a commutative hash based on three sets of "score" structs.

Each score has a "start", an "end" and a "number".

Both start and end are usually huge numbers (8-9 digits) but number is just from 1 to 4.

I need them to be commutative so the order does not matter. I'm using XOR at the moment but it seems to be giving bad results.

Since I'm working with large large datasets, I'd prefer a performance-friendly solution. Any suggestions? Thanks =]

 public static int getCustomHash(cnvRegion c1, cnvRegion c2, cnvRegion c3)
 {
  int part1 = (c1.startLocation * c2.startLocation * c3.startLocation);
  int part2 = (c1.endLocation * c2.endLocation * c3.endLocation);
  int part3 = (c1.copyNumber + c2.copyNumber + c3.copyNumber)*23735160;
  return part1 ^ part2 ^ part3;
 }
A: 

Thomas Wang has a discussion of hash functions here.

  • See the knuth's method, and the 64 to 32-Bit mix functions.

Paul Hsieh also has a page on integer hashing, which describes his "SuperFastHash" function which got mixed feedback.

EDIT

Because you want your custom hash to be commutative (I assume between the cnvRegion params) you could probably write something like this:

public int hash6432shift(long key)
{
   key = (~key) + (key << 18); // key = (key << 18) - key - 1;
   key = key ^ (key >>> 31);
   key = key * 21; // key = (key + (key << 2)) + (key << 4);
   key = key ^ (key >>> 11);
   key = key + (key << 6);
   key = key ^ (key >>> 22);
   return (int) key;
}

public static int getCustomHash(cnvRegion c1, cnvRegion c2, cnvRegion c3)
{
    int part1 = (c1.startLocation ^ c2.startLocation ^ c3.startLocation);
    int part2 = (c1.endLocation ^ c2.endLocation ^ c3.endLocation);
    int part3 = (c1.copyNumber ^ c2.copyNumber ^ c3.copyNumber);

    int hash1 = hash6432shift(((long)part1 << 0x20) | part2);
    return hash6432shift(((long)hash1 << 0x20) | part3);
}

However, in the end the task of finding a hash function that is both fast and provides good collision resistance is very dependent of the data you are processing.

Let me give you an example:

Let's say that the values you are hashing are large, 10 digit numbers, and they represent a UNIX timestamp (the time elapsed in seconds since 01/01/1970). In this case, hashing a lot of timestamps that occur within a limited time span - say over a month is simply a matter of eliminating the portion that doesn't change, and using only the portion of the timestamp that changes a lot. This is the same as saying that you are eliminating the portions that have low entropy.

v1 = 1241536920   // 5/5/2009 3:22:00 PM
v2 = 1241529720   // 5/5/2009 1:22:00 PM
v3 = 1241270520   // 5/2/2009 1:22:00 PM
v4 = 1242825720   // 5/20/2009 1:22:00 PM

It is pretty clear that we could safely eliminate the first 3-4 digits and only use the remaining digits as the hash. Also, if these values usually occurred within a few minutes of each other you can also drop the last 2-3 digits.

In this manner, you're left with only 4 digits that you can use as a hash with a pretty good collision resistance for our case example.

My point is that hash functions can be highly optimized if you know the statistical distribution of the values your are trying to hash.

Miky Dinescu
I'm having a hard time understanding the crypto language, and the information doesnt seem to apply to my case (getting a COMMUTATIVE hash from THREE sources).
DarkAmgine
so I was looking at the hashes that my original code generated and was surprised to see it fairly evenly distributed from -2147483648 to +2147483647 Turns out I'm just getting a lot of collisions because there's just too many data for using int.Thanks
DarkAmgine
Well, in that case you'll probably have to try Int64.. :)
Miky Dinescu
by the way, do you know what ">>>" means? c# doesn't seem to like it...
DarkAmgine
nevermind, i used >> and it worked.
DarkAmgine
In Java, the ">>>" is an unsigned right shift and it is the same thing as doing a right shift on an UInt32 in C#. (Java doesn't have unsigned types)
Miky Dinescu
A: 

First, I think the requirements are not quite clear. If you hash three datasets c1, c2 and c3. Then if you switch, c1.copyNumber and c2.copyNumber and hash again. Should that give the same result or not? If you switch c1.startLocation with c1.endLocation. Should that result in the same hash or not?

I'm going to assume that you'd like to have different hash results in both cases and that the only permutation that should not change the hash result are permutations of the datasets c1, c2, c3.

If that is the case then I'd propose to first hash the three datasets independently to smaller values. I.e. h1 = H(c1) h2 = H(c2) h3 = H(c3) where H can be any hash function (e.g., CRC32, Adler32, SHA1 etc) depending on how hard you want to avoid collisions.

The next step would be to compute a commutative hash of h1, h2, h3. If you want to avoid collisions unless h1, h2, h3 are permuted then the following works. Compute the polynomial

  • P(x) = (x-h1)(x-h2)(x-h3)

then hash the polynomial (rsp. its coefficients) with any good hash function. I.e. that would be

  • H(h1+h2+h3 || h1 * h2 + h1 * h3 + h2 * h3 || h1 * h2 * h3), where || is concatenation.

If you want to avoid any unecessary collision at all cost then the coefficients should be computed as multiprecision integers and a collision resistant hash function such as SHA1 should be used. Because of the unique factorisation property of polynomials if follows that the coefficents of the polynomial are different if h1, h2 and h3 are different. But it seems that avoiding collisions at all cost is overkill in your case.

So rather than computing a polynomial P(x) symbolically one could just evaluate it at a arbitrary value R. I.e. if h1, h2, h3 are just 32-bit values then computing the following might be enough: (some C type pseudocode follows. I'm don't know what C# uses for 64-bit integers)

const long long R = SOME_RANDOM_64_BIT_CONSTANT;
long long hash0 = (R - h1) * (R - h2) * (R - h3);
int hash = (int) (hash0 >> 32);

I'm 64-bit multiplication here, because they are fast enough on modern CPUs and I'm using the upper 32-bit of hash0 rather than the lower 32 bit because the lower 32 bits are somewhat biased. I.e., the least significant bit is much more likely to be 0 than 1.

Accipitridae