views:

430

answers:

3

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

+2  A: 

Have you considered using a lexer?

I've never actually used one so i can't be much help, but it sounds like it fits!

Paul Creasey
I think lexers are more for tokenizing than matching. If I start splitting my string, I won't be able to recognize characters moved from one token to another.
Thomas Ahle
You may have to define your problem as a lexing/parsing problem, rather than as a simple regular expression. Then you could use Levenshtein distance on the individual tokens.
Avi
I see. But the lexer link you've sent seams quite deterministic. What if instead of `TV Schedule for 10 Jan` I get `TV Schedule for Jan 10`? That should have a distance of 2, since two characters have been transposed. Maybe the lexer could indentify substrings looking like numbers or months, but then `TV Schedule forJan 10` or `TV Schedule for 10 Jan 2010` would give problems..
Thomas Ahle
+1  A: 

Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.

In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.

bmargulies
The first one seams to be on standard approximate string matching? The second one seams to be on fuzzy lookups in a dictionary. That could probably be used thinking of the regex as a 'fictionary dictionary'?
Thomas Ahle
+8  A: 

I found the TRE library, witch seams to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

Thanks a lot for all your proposes.

Thomas Ahle
+1 for solving your own problem. If just everyone would do that … ;-)
Gumbo