ansaurus

Question

First-Occurrence Parallel String Matching Algorithm

Answer 1

+3 A:

Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table

Process Index
   1    -1
   2    2
   3    23

After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).

The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.

EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

Il-Bhima 2010-02-22 22:18:21

Thanks for the pointer :) this did end up working out to be far far more efficient than any ideas I had.

Xorlev 2010-02-23 01:41:57

Answer 2

+1 A:

I am afraid that breaking the string will not do.

Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.

But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.

Here is my idea, let's say we have:

Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.

Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.

Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.

The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).

Let's have an example, with 3 processors:

[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
                      x       x

The chunks 6 and 8 both contain the string.

The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).

Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.

The key idea is that you don't want to look at the whole text! It's wasteful!

Matthieu M. 2010-02-26 13:25:32

+1 Awesome answer. The assignment has long since been turned in but I love to see how this could work. I tend to obsess over fun and interesting problems for weeks. :) I do hope others find this answer also useful and uprate for it as it's one of the clearest answers I've seen.

Xorlev 2010-02-26 18:33:58

ansaurus

tags:

views:

answers:

First-Occurrence Parallel String Matching Algorithm

related questions