views:

283

answers:

1

Reciently I've looked through several implementation of bitap algorithm but what all of them do is finding the beginning point of fuzzy match. What I need is to find a match. There's an example:

Say we have following text: abcdefg

and a pattern: bzde

and we want to find all occurence of a pattern in text with at most 1 error (Edit distance is concidered).

So I need that the algorithm returns: bcde.

Is there a simple (or not simple =) ) way to do it? The original artical about this algorithm doens't answer the question.

Thank you for your help.

+1  A: 

For a simple start you could approach it with a series of regular expressions where in every expression you replace 1 character with the . wildcard. Combining those expressions into one using the ( | ) construct to create one big regular expression.

Another way would be to scan the string keeping an errorcount and incrementing the offset you match at when too many errors are encountered.

rsp
Thank you for your answer.1. I don't think that the approach with regular expression will be useful. Let's consider a case when we have a deleting or insertion as an edit operation. In this case we will need not a '.' in our regular expression but something like '.{0,2}' and we'll face several more problems especially when count of errors we allow is greater than 1.2. As far as I understood we should look at the difference between matrices d_k and d_(k+1) and we should find out what an edit operatioin leads from d_k to d_(k+1) (insertion, deleting,so on). Is it right?
StuffHappens
The bitap method gives the start of the match, finding the end is easier once you know a match starts at that index. You could compare match and searchstring until the number of error is too big or the searchstring ends. If you want to derive the end index from the bitap algorithm itself, look at the point where it decides to return a match and what index it inspected last.
rsp
Thanks again...
StuffHappens