views:

1205

answers:

2

According to the wikipedia entry on Rabin-Karp string matching algorithm, it can be used to look for several different patterns in a string at the same time while still maintaining linear complexity. It is clear that this is easily done when all the patterns are of the same length, but I still don't get how we can preserve O(n) complexity when searching for patterns with differing length simultaneously. Can someone please shed some light on this?

+1  A: 

I'm not sure if this is the correct answer, but anyway:

While constructing the hash value, we can check for a match in the set of string hashes. Aka, the current hash value. The hash function/code is usually implemented as a loop and inside that loop we can insert our quick look up.

Of course, we must pick m to have the maximum string length from the set of strings.

Update: From Wikipedia,

[...]
for i from 1 to n-m+1
         if hs ∈ hsubs
             if s[i..i+m-1] = a substring with hash hs
                 return i
         hs := hash(s[i+1..i+m]) // <---- calculating current hash
[...]

We calculate current hash in m steps. On each step there is a temporary hash value that we can look up ( O(1) complexity ) in the set of hashes. All hashes will have the same size, ie 32 bit.

Update 2: an amortized (average) O(n) time complexity ?

Above I said that m must have the maximum string length. It turns out that we can exploit the opposite.
With hashing for shifting substring search and a fixed m size we can achieve O(n) complexity.

If we have variable length strings we can set m to the minimum string length. Additionally, in the set of hashes we don't associate a hash with the whole string but with the first m-characters of it.
Now, while searching the text we check if the current hash is in the hash set and we examine the associated strings for a match.

This technique will increase the false alarms but on average it has O(n) time complexity.

Nick D
Could you please elaborate?As far as I can understand, you are suggesting keeping multiple hashes (one for each length of pattern) and using those to query a hashtable/BST. But, doesn't computing more than a constant number if hashes each iteration make the complexity more than linear?
MAK
@MAK, see my update.
Nick D
Thanks for explaining. But that is exactly the source of my confusion.If we compute the current hash value in m steps, our overall complexity is no longer linear. It becomes O(n*m) (n is the length of the string, m is length of the longest pattern).
MAK
@MAK, O(n) is feasible if all strings are `m` size and we calc the hash as mentioned in section *Use of hashing for shifting substring search*. But for variable length strings, IMO, O(nm) is the best we can do with Rabin-Karp algorithm. Maybe that Wiki entry isn't clear.
Nick D
I'm beginning to think that the wikipedia entry is simply wrong on this. It claims the complexity should be O(n+k). Thanks again.
MAK
@MAK, see update2. If I come up with a better solution, I'll post a new update.
Nick D
@Nick D., Thank you very much. That solution is probably the correct O(n) algorithm (or very close). But suppose the smallest pattern length is 1 (i.e. single character pattern). That would still cause a very high rate false alarms if the input alphabet is small (e.g. 'A'-'Z'), probably enough to cause near O(nm) performance in practice. But thanks, this is a brilliant solution nonetheless.
MAK
@Nick D., Had a talk about this with a friend, he came up with an input case that causes this approach to fail. Try your algorithm against this: pattern 1: blah pattern 2: aaaabcd Text to search: aaaaaaaaaaa.........aa (all 'a's) That would still cause O(nm) performance.
MAK
I'm aware of such cases. That kind of data show off the worst part of the algorithm. There is probably a way to minimize the false alarms but I haven't figured it out, yet.
Nick D
note that even quicksort in the worst case makes Θ(n2) comparisons ;) (excuses, excuses, excuses)
Nick D
A: 

It's because the hash values of the substrings are related mathematically. Computing the hash H(S,j) (the hash of the characters starting from the jth position of string S) takes O(m) time on a string of length m. But once you have that, computing H(S, j+1) can be done in constant time, because H(S, j+1) can be expressed as a function of H(S, j).

O(m) + O(1) => O(m), i.e. linear time.

Here's a link where this is described in more detail (see e.g. the section "What makes Rabin-Karp fast?")

ire_and_curses
I get why Rabin-Karp is fast. I've used to before to find single patterns in a string. I'm trying to figure out how it can be used to find multiple patterns in a string simultaneously in O(n) time (as opposed to O(n*k) if you search for k patterns one by one).
MAK
@MAK: Sorry, I misunderstood your question. Isn't the answer to that at the bottom of the wikipedia article? "In contrast, the variant Rabin-Karp above can find all k patterns in O(n+k) time in expectation, because a hash table checks whether a substring hash equals any of the pattern hashes in O(1) time." Creating the hash is O(k). Looking for a match in a hash table is an O(1) operation. If any match, you win.
ire_and_curses