views:

354

answers:

4

I'm looking to use a rolling hash function so I can take hashes of n-grams of a very large string.

For example:

"stackoverflow", broken up into 5 grams would be:

"stack", "tacko", "ackov", "ckove", "kover", "overf", "verfl", "erflo", "rflow"

This is ideal for a rolling hash function because after I calculate the first n-gram hash, the following ones are relatively cheap to calculate because I simply have to drop the first letter of the first hash and add the new last letter of the second hash.

I know that in general this hash function is generated as:

H = c1ak − 1 + c2ak − 2 + c3ak − 3 + ... + cka0 where a is a constant and c1,...,ck are the input characters.

If you follow this link on the Rabin-Karp string search algorithm , it states that "a" is usually some large prime.

I want my hashes to be stored in 32 bit integers, so how large of a prime should "a" be, such that I don't overflow my integer?

Does there exist an existing implementation of this hash function somewhere that I could already use?


Here is an implementation I created:

public class hash2
{

    public int prime = 101;

    public int hash(String text)
    {
        int hash = 0;

        for(int i = 0; i < text.length(); i++)
        {
            char c = text.charAt(i);
            hash += c * (int) (Math.pow(prime, text.length() - 1 - i));
        }

        return hash;
    }

    public int rollHash(int previousHash, String previousText, String currentText)
    {

        char firstChar = previousText.charAt(0);
        char lastChar = currentText.charAt(currentText.length() - 1);

        int firstCharHash = firstChar * (int) (Math.pow(prime, previousText.length() - 1));
        int hash = (previousHash - firstCharHash) * prime + lastChar;

        return hash;
    }

    public static void main(String[] args)
    {
        hash2 hashify = new hash2();

        int firstHash = hashify.hash("mydog");
        System.out.println(firstHash);
        System.out.println(hashify.hash("ydogr"));
        System.out.println(hashify.rollHash(firstHash, "mydog", "ydogr"));
    }

}

I'm using 101 as my prime. Does it matter if my hashes will overflow? I think this is desirable but I'm not sure.

Does this seem like the right way to go about this?

A: 

As i understand it's a function minimization for:

2^31 - sum (maxchar) * A^kx

where maxchar = 62 (for A-Za-z0-9). I've just calculated it by Excel (OO Calc, exactly) :) and a max A it found is 76, or 73, for a prime number.

splix
+1  A: 

i remember a slightly different implementation which seems to be from one of sedgewick's algorithms books (it also contains example code - try to look it up). here's a summary adjusted to 32 bit integers:

you use modulo arithmetic to prevent your integer from overflowing after each operation.

initially set:

  • c = text ("stackoverflow")
  • M = length of the "n-grams"
  • d = size of your alphabet (256)
  • q = a large prime so that (d+1)*q doesn't overflow (8355967 might be a good choice)
  • dM = dM-1 mod q

first calculate the hash value of the first n-gram:

h = 0
for i from 1 to M:
  h = (h*d + c[i]) mod q

and for every following n-gram:

for i from 1 to lenght(c)-M:
  // first subtract the oldest character
  h = (h + d*q - c[i]*dM) mod q

  // then add the next character
  h = (h*d + c[i+M]) mod q

the reason why you have to add d*q before subtracting the oldest character is because you might run into negative values due to small values caused by the previous modulo operation.

errors included but i think you should get the idea. try to find one of sedgewick's algorithms books for details, less errors and a better description. :)

stmax
A: 

Not sure what your aim is here, but if you are trying to improve performance, using math.pow will cost you far more than you save by calculating a rolling hash value.

I suggest you start by keeping to simple and efficient and you are very likely find it is fast enough.

Peter Lawrey