views:

93

answers:

6

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.

For exemple this is what I want:
f((1 2 3))=f((1 3 2))=f((2 1 3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).

Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.

I'm looking for a simple approach if there is any.

+1  A: 

Sort the list and then:

list.each do |current_element|
  hash = (37 * hash + current_element) % MAX_HASH_VALUE
end
glebm
can you explain this a bit,why 39 and who is MAX_HASH_VALUE ?
John Retallack
if you want to limit the size of the hash value (e.g. you want to have a hash table), then ` % MAX_HASH_VALUE ` would be an easy way to keep the hash values in the range of `(0..MAX_HASH_VALUE)`. 39 is the smallest prime that works well with these kind of hash functions (for some way more complicated mathematical reasons). note, that this function is not collision free, but you only have to compare element by element when the hashes are the same, which is very rare (try to come up with lists that give the same hash -- it's not easy)
glebm
You'd have to sort the elements of the list first for this to work.
Keith Randall
Well, of course. Fixed
glebm
@glebm I think you meant 37? 39 isn't prime.
AakashM
I screwed up :)
glebm
A: 

If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.

Hard to describe in words... For example, if your maximum is 9 (that makes it easy to understand), you'd have :

f(2 3 9 8) = f(3 8 9 2) = 2389

If you maximum was 99, you'd have :

f(16 2 76 8) = (0)2081676

In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :

f(2 2 5) = 2*6^2 + 2*6 + 5 = 89 f(1 4 4) = 1*6^2 + 4*6 + 4 = 64

Matthieu
A: 

So you are looking something provides these properties,

1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1

2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.

Knuth's Multiplicative Method

In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.

hash(i)=i*2654435761 mod 2^32

Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.

Robert Jenkins' 96 bit Mix Function

Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.

All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.

The following source is the mixing part of the hash function.

int mix(int a, int b, int c)
{
  a=a-b;  a=a-c;  a=a^(c >>> 13);
  b=b-c;  b=b-a;  b=b^(a << 8); 
  c=c-a;  c=c-b;  c=c^(b >>> 13);
  a=a-b;  a=a-c;  a=a^(c >>> 12);
  b=b-c;  b=b-a;  b=b^(a << 16);
  c=c-a;  c=c-b;  c=c^(b >>> 5);
  a=a-b;  a=a-c;  a=a^(c >>> 3);
  b=b-c;  b=b-a;  b=b^(a << 10);
  c=c-a;  c=c-b;  c=c^(b >>> 15);
  return c;
}

You can read details from here

berkay
These are ways of distributing the bits of values around so that close together values are separate - so you can use these to spread out a hash function which does not have collisions; it's not very clear how to apply these to problem of creating a hash function for the problem in hand to start with.
Pete Kirkham
+1  A: 

You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.

Possibly you have N, k such that you could still make this work. Good luck.

If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).

Keith Randall
A: 

Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:

template <class T>
void hash_combine(size_t& seed, T const& v)
{
  seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}

It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).

However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.

Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...

Matthieu M.
A: 

I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.

[...] but it turned out that there are collisions

These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.

Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.

As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.

For example:

2 1 3

sorted becomes

1 2 3

Treating this as prime powers gives

2^1.3^2.5^3

which construct

2.9.125 = 2250

giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.

AakashM