views:

897

answers:

4

Are there any known hash algorithms which input a vector of int's and output a single int that work similarly to an inner product?

In other words, I am thinking about a hash algorithm that might look like this in C++:

// For simplicity, I'm not worrying about overflow, and assuming |v| < 7.
int HashVector(const vector<int>& v) {
  const int N = kSomethingBig;
  const int w[] = {234, 739, 934, 23, 828, 194};  // Carefully chosen constants.
  int result = 0;
  for (int i = 0; i < v.size(); ++i) result = (result + w[i] * v[i]) % N;
  return result;
}

I'm interested in this because I'm writing up a paper on an algorithm that would benefit from any previous work on similar hashes. In particular, it would be great if there is anything known about the collision properties of a hash algorithm like this.

The algorithm I'm interested in would hash integer vectors, but something for float vectors would also be cool.

Clarification

The hash is intended for use in a hash table for fast key/value lookups. There is no security concern here.

The desired answer is something like a set of constants that provably work particularly well for a hash like this - analogous to a multiplier and modulo which works better than others as a pseudorandom number generator.

For example, some choices of constants for a linear congruential pseudorandom generator are known to give optimal cycle lengths and have easy-to-compute modulos. Maybe someone has done research to show that a certain set of multiplicative constants, along with a modulo constant, in a vector hash can reduce the chance of collisions amongst nearby integer vectors.

+1  A: 

Depending on the size of the constants, I'd have to say the degree of chaos in the input vector will have an impact on the result. However, a quick qualitative analysis of your post would suggest that you have a good start:

  • Your inputs are multiplied, therefore increasing the degree of separation between similar input values per iteration (for instance, 65 + 66 is much smaller than 65 * 66), which is good.
  • It's deterministic, unless your vector should be considered a set and not a sequence. For clarity, should v = { 23, 30, 37 } be different than v = { 30, 23, 37 }?
  • The uniformity of distribution will be varied based on the range and chaos of input values in v. However, that's true of a generalized integer hashing algorithm as well.

Out of curiousity, why not just use an existing hashing algorithm for integers and perform some interesting math on the results?

Rob
I'm writing a paper on an algorithm and am interested in finding references on this topic, so I can't get away with saying "the STL uses this implementation so it must be good".
Tyler
A: 

While i might be totally misunderstanding you, maybe it's a good idea to treat a vector as a byte stream and do some know hash on it, i.e. SHA1 or MD5.

Just to clarify, those hashes are known to have good hash properties, and i believe there's no reason to reinvent a bicycle and to implement new hash. Another possibility is to use known CRC angorithm.

Drakosha
Thanks but SHA1 and MD5 are designed for security, and not designed with an optimal aim of avoiding collisions. They also work very differently from an inner product.
Tyler
A: 

Python used to hash tuples in this manner (source):

class tuple:
    def __hash__(self):
        value = 0x345678
        for item in self:
            value = c_mul(1000003, value) ^ hash(item)
        value = value ^ len(self)
        if value == -1:
            value = -2
        return value

In your case, item would always be an integer, which uses this algorithm:

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value == -2
        return value

This does have nothing to do with an inner product, though... so maybe it's not much help.

Claudiu
+1  A: 

I did some (unpublished, practical) experiments with testing a variety of string hash algorithms. (It turns out that Java's default hash function for Strings sucks.)

The easy experiment is to hash the English dictionary and compare how many collisions you have on algorithm A vs algorithm B.

You can construct a similar experiment: randomly generate $BIG_NUMBER of possible vectors of length 7 or less. Hash them on algorithm A, hash them on algorithm B, then compare number and severity of collisions.

After you're able to do that, you can use simulated annealing or similar techniques to find "magic numbers" which perform well for you. In my work, for given vocabularies of interest and a tightly limited hash size, we were able to make a generic algorithm work well for several human languages by varying the "magic numbers".

Patrick McKenzie
Good idea, Patrick. This sounds like a very practical and effective way to find an actual algorithm. I'm still curious about any previously existing published work on this problem.
Tyler