ansaurus

Question

Answer 1

+3 A:

Yes.

Mitch Wheat 2009-08-09 02:23:34

Yes, both BM and KMP will be faster than the pseudocode shown, but will still only search for one pattern at a time. So the outermost `foreach (patterns) {}` will still be around.

earl 2009-08-09 02:30:49

Answer 2

+1 A:

You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ). BUT .. you have to build that array. It's only worth .. when the Text is static and the pattern dynamic .

n00ki3 2009-08-09 02:24:44

Answer 3

A:

If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.

Jacob

TheJacobTaylor 2009-08-09 02:43:15

What do you mean by "flexible"?

erjiang 2009-08-09 02:48:25

Sorry about that, I was tired. I was essentially thinking of the Turbo Boyer-Moore algorithm (http://www-igm.univ-mlv.fr/~lecroq/string/node15.html) but could not remember the name of it. The nice thing about BM is that it pre-processes all of the strings that are being searched for, and runs through the main string one time to search for any match for any of the strings. That is why it does not pre-process the main string. It makes only one pass through. Granted, base BM can have 3n comparisons (where n is the size of the string being searched). Turbo BM is limited to 2n.

TheJacobTaylor 2009-08-11 01:25:46

Answer 4

+6 A:

The key for Googling is "multi-pattern" string matching.

Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.

An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.

I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide wether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.

earl 2009-08-09 02:57:47

I personally reccomend rabin-karp as it's relatively good at multi-pattern searches as well as being dirt simple to implement. The only remotely challenging part is the rolling hash, which you can read up on here (http://en.wikipedia.org/wiki/Rolling_hash)

Falaina 2009-08-09 03:06:01

Answer 5

A:

If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.

The code looks like this:

try:
    import psyco
    psyco.full()
except ImportError:
    pass

def encode(s):
    def calc_count(count, c):
        return count * (-1 if c == '0' else 1)
    result = []
    c = s[0]
    count = 1
    for i in range(1, len(s)):
        d = s[i]
        if d == c:
            count += 1
        else:
            result.append(calc_count(count, c))
            count = 1
            c = d
    result.append(calc_count(count, c))
    return result

def search(encoded_source, targets):
    def match(encoded_source, t, max_search_len, len_source):
        x = len(t)-1
        # Get the indexes of the longest segments and search them first
        most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
        # Align the signs of the source and target
        index = (0 if encoded_source[0] * t[0] > 0 else 1)
        unencoded_pos = sum(abs(c) for c in encoded_source[:index])
        start_t, end_t = abs(t[0]), abs(t[x])
        for i in range(index, len(encoded_source)-x, 2):
            if all(t[j] == encoded_source[j+i] for j in most_restrictive):
                encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
                if start_t <= encoded_start and end_t <= encoded_end:
                    return unencoded_pos + (abs(encoded_source[i]) - start_t)
            unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
            if unencoded_pos > max_search_len:
                return len_source
        return len_source
    len_source = sum(abs(c) for c in encoded_source)
    i, found, target_index = len_source, None, -1
    for j, t in enumerate(targets):
        x = match(encoded_source, t, i, len_source)
        print "Match at: ", x
        if x < i:
            i, found, target_index = x, t, j
    return (i, found, target_index)


if __name__ == "__main__":
    import datetime
    def make_source_text(len):
        from random import randint
        item_len = 8
        item_count = 2**item_len
        table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
        return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
    targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
    encoded_targets = [encode(t) for t in targets]
    data_len = 10*1000*1000
    s = datetime.datetime.now()
    source_text = make_source_text(data_len)
    e = datetime.datetime.now()
    print "Make source text(length %d): " % data_len,  (e - s)
    s = datetime.datetime.now()
    encoded_source = encode(source_text)
    e = datetime.datetime.now()
    print "Encode source text: ", (e - s)

    s = datetime.datetime.now()
    (i, found, target_index) = search(encoded_source, encoded_targets)
    print (i, found, target_index)
    print "Target was: ", targets[target_index]
    print "Source matched here: ", source_text[i:i+len(targets[target_index])]
    e = datetime.datetime.now()
    print "Search time: ", (e - s)

On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.

psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:

Make source text(length 10000000):  0:00:02.277000
Encode source text:  0:00:00.329000
Match at:  2517905
Match at:  494990
Match at:  450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was:  1010010001010100100010
Source matched here:  1010010001010100100010
Search time:  0:00:04.325000

It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

hughdbrown 2009-08-09 02:59:14

I'm concerned that the time it takes to encode 8 million bits into runs, doing a few searches, and then decoding them back into raw bits might take more time than doing BM or KMP searches on the raw bits

erjiang 2009-08-09 03:53:28

That's certainly a possibility.

hughdbrown 2009-08-09 04:03:38

Answer 6

+1 A:

A solution that could be efficient:

store the patterns in a trie data structure
start searching the list
check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
step one char and repeat #3

If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.

Nick D 2009-08-09 03:24:03

Answer 7

A:

If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.

It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.

Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.

erjiang 2009-08-09 03:49:06

ansaurus

tags:

views:

answers:

Algorithm for linear pattern matching?

related questions