views:

771

answers:

7

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.

Is there a faster method than doing

foreach (patterns) {
    for (i=0; i < listLength; i++)
        for(t=0; t < patternlength; t++)
            if( list[i+t] != pattern[t] ) {
                 break;
            }
            if( t == patternlength - 1 ) {
                 return i;  // pattern found!
            }
        }
    }
}

Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.

Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.

Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says

The algorithm preprocesses the target string (key) that is being searched for, but not the string being searched in (unlike some algorithms that preprocess the string to be searched and can then amortize the expense of the preprocessing by searching repeatedly).

That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?

+3  A: 

Yes.

The Boyer–Moore string search algorithm

See also: Knuth–Morris–Pratt algorithm

Mitch Wheat
Yes, both BM and KMP will be faster than the pseudocode shown, but will still only search for one pattern at a time. So the outermost `foreach (patterns) {}` will still be around.
earl
+1  A: 

You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ). BUT .. you have to build that array. It's only worth .. when the Text is static and the pattern dynamic .

n00ki3
A: 

If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.

Jacob

TheJacobTaylor
What do you mean by "flexible"?
erjiang
Sorry about that, I was tired. I was essentially thinking of the Turbo Boyer-Moore algorithm (http://www-igm.univ-mlv.fr/~lecroq/string/node15.html) but could not remember the name of it. The nice thing about BM is that it pre-processes all of the strings that are being searched for, and runs through the main string one time to search for any match for any of the strings. That is why it does not pre-process the main string. It makes only one pass through. Granted, base BM can have 3n comparisons (where n is the size of the string being searched). Turbo BM is limited to 2n.
TheJacobTaylor
+6  A: 

The key for Googling is "multi-pattern" string matching.

Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.

An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.

I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide wether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.

earl
I personally reccomend rabin-karp as it's relatively good at multi-pattern searches as well as being dirt simple to implement. The only remotely challenging part is the rolling hash, which you can read up on here (http://en.wikipedia.org/wiki/Rolling_hash)
Falaina
A: 

If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.

The code looks like this:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

def encode(s):
    def calc_count(count, c):
        return count * (-1 if c == '0' else 1)
    result = []
    c = s[0]
    count = 1
    for i in range(1, len(s)):
        d = s[i]
        if d == c:
            count += 1
        else:
            result.append(calc_count(count, c))
            count = 1
            c = d
    result.append(calc_count(count, c))
    return result

def search(encoded_source, targets):
    def match(encoded_source, t, max_search_len, len_source):
        x = len(t)-1
        # Get the indexes of the longest segments and search them first
        most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
        # Align the signs of the source and target
        index = (0 if encoded_source[0] * t[0] > 0 else 1)
        unencoded_pos = sum(abs(c) for c in encoded_source[:index])
        start_t, end_t = abs(t[0]), abs(t[x])
        for i in range(index, len(encoded_source)-x, 2):
            if all(t[j] == encoded_source[j+i] for j in most_restrictive):
                encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
                if start_t <= encoded_start and end_t <= encoded_end:
                    return unencoded_pos + (abs(encoded_source[i]) - start_t)
            unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
            if unencoded_pos > max_search_len:
                return len_source
        return len_source
    len_source = sum(abs(c) for c in encoded_source)
    i, found, target_index = len_source, None, -1
    for j, t in enumerate(targets):
        x = match(encoded_source, t, i, len_source)
        print "Match at: ", x
        if x < i:
            i, found, target_index = x, t, j
    return (i, found, target_index)


if __name__ == "__main__":
    import datetime
    def make_source_text(len):
        from random import randint
        item_len = 8
        item_count = 2**item_len
        table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
        return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
    targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
    encoded_targets = [encode(t) for t in targets]
    data_len = 10*1000*1000
    s = datetime.datetime.now()
    source_text = make_source_text(data_len)
    e = datetime.datetime.now()
    print "Make source text(length %d): " % data_len,  (e - s)
    s = datetime.datetime.now()
    encoded_source = encode(source_text)
    e = datetime.datetime.now()
    print "Encode source text: ", (e - s)

    s = datetime.datetime.now()
    (i, found, target_index) = search(encoded_source, encoded_targets)
    print (i, found, target_index)
    print "Target was: ", targets[target_index]
    print "Source matched here: ", source_text[i:i+len(targets[target_index])]
    e = datetime.datetime.now()
    print "Search time: ", (e - s)

On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.

psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:

Make source text(length 10000000):  0:00:02.277000
Encode source text:  0:00:00.329000
Match at:  2517905
Match at:  494990
Match at:  450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was:  1010010001010100100010
Source matched here:  1010010001010100100010
Search time:  0:00:04.325000

It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

hughdbrown
I'm concerned that the time it takes to encode 8 million bits into runs, doing a few searches, and then decoding them back into raw bits might take more time than doing BM or KMP searches on the raw bits
erjiang
That's certainly a possibility.
hughdbrown
+1  A: 

A solution that could be efficient:

  1. store the patterns in a trie data structure
  2. start searching the list
  3. check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
  4. step one char and repeat #3

If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.

Nick D
A: 

If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.

It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.

Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.

erjiang