ansaurus

Question

Answer 1

+8 A:

I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".

So:

"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)

Jon Skeet 2010-04-06 13:29:50

Answer 2

+2 A:

You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.

Chris Lercher 2010-04-06 13:30:30

Answer 3

+1 A:

I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)

You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.

PatrikAkerstrand 2010-04-06 13:31:17

Answer 4

+4 A:

Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.

I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.

static int Hash(string s)
{
    int H = 0;

    if (s.Length > 0)
    {
        //any arbitrary coprime numbers
        int a = s.Length, b = s.Length + 1;

        //an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
        int[] c = new int[0xFF];

        for (int i = 1; i < c.Length; i++)
        {
            c[i] = i + 1;
        }

        Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
        Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();

        //for i=0 we need to wrap around to the last character
        H = NextPair(s[s.Length - 1], s[0]);

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= NextPair(s[i - 1], s[i]);
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine("{0:X8}", Hash("abcdef"));
    Console.WriteLine("{0:X8}", Hash("bcdefa"));
    Console.WriteLine("{0:X8}", Hash("cdefab"));
    Console.WriteLine("{0:X8}", Hash("cdfeab"));
    Console.WriteLine("{0:X8}", Hash("a0a0"));
    Console.WriteLine("{0:X8}", Hash("1010"));
    Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
    Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}

The output is now:

7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71

First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:

static int Hash(char[] s)
{
    //any arbitrary coprime numbers
    const int a = 7, b = 13;

    int H = 0;

    if (s.Length > 0)
    {
        //for i=0 we need to wrap around to the last character
        H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine(Hash("abcdef".ToCharArray()));
    Console.WriteLine(Hash("bcdefa".ToCharArray()));
    Console.WriteLine(Hash("cdefab".ToCharArray()));
    Console.WriteLine(Hash("cdfeab".ToCharArray()));
}

The output is:

Michael Petito 2010-04-06 13:44:23

Also, as for checking n strings against each other, you might consider feeding K versions of this hash algorithm (perhaps using different coprimes) into a bloom filter of sufficient size for n.

Michael Petito 2010-04-06 13:52:51

It's fairly easy to come up with collisions here. For example, "a0a0" and "1010" (or indeed anything similar) will come up with a hash of 0, and "blocks" with a common boundary confuse it: "0abc0def0ghi" and "0def0abc0ghi" have the same hash. Nice idea though.

Jon Skeet 2010-04-06 14:26:12

@Jon Skeet Yes, you are absolutely right. I wonder if there is a simple modification that could be made to handle such input...

Michael Petito 2010-04-06 16:39:02

Answer 5

A:

I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.

If you can represent the string as a matrix of associations so abcdef would look like

  a b c d e f
a   x
b     x
c       x
d         x
e           x
f x

But so would any combination of those associations. It would be trivial to compare those matrices.

Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.

Here is some Ruby code:

def normalize_string(string)
  myarray = string.split(//)            # split into an array
  index   = myarray.index(myarray.min)  # find the index of the minimum element
  index.times do
    myarray.push(myarray.shift)         # move stuff from the front to the back
  end
  return myarray.join
end

p normalize_string('abcdef').eql?normalize_string('defabc') # should return true

Fotios 2010-04-06 14:22:06

@Fotios: Would the first solution really work if the elements aren't unique? "ab" and "abab" would produce the same matrix, if I understand it correctly? It may still be good enough for a hash function!

Chris Lercher 2010-04-06 14:48:37

Yea, it probably would not work with multiples like that, but there might be ways to work around that.

Fotios 2010-04-06 15:05:07

ansaurus

tags:

views:

answers:

Is there a circular hash function?

related questions