ansaurus

Question

Answer 1

+2 A:

Have you considered using a lexer?

I've never actually used one so i can't be much help, but it sounds like it fits!

Paul Creasey 2010-02-28 16:18:47

I think lexers are more for tokenizing than matching. If I start splitting my string, I won't be able to recognize characters moved from one token to another.

Thomas Ahle 2010-02-28 17:31:05

You may have to define your problem as a lexing/parsing problem, rather than as a simple regular expression. Then you could use Levenshtein distance on the individual tokens.

Avi 2010-02-28 17:36:52

I see. But the lexer link you've sent seams quite deterministic. What if instead of `TV Schedule for 10 Jan` I get `TV Schedule for Jan 10`? That should have a distance of 2, since two characters have been transposed. Maybe the lexer could indentify substrings looking like numbers or months, but then `TV Schedule forJan 10` or `TV Schedule for 10 Jan 2010` would give problems..

Thomas Ahle 2010-03-04 10:15:51

Answer 2

+1 A:

Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.

In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.

bmargulies 2010-02-28 16:33:06

The first one seams to be on standard approximate string matching? The second one seams to be on fuzzy lookups in a dictionary. That could probably be used thinking of the regex as a 'fictionary dictionary'?

Thomas Ahle 2010-02-28 21:55:51

Answer 3

+8 A:

I found the TRE library, witch seams to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename
1:TV Schedule for 10Jan
8:TVSchedule for Jan 10
7:T Schedule for 10 Jan 2010
3:TV Schedule for 10 March
15:Tv plan for March

Thanks a lot for all your proposes.

Thomas Ahle 2010-03-04 10:54:24

+1 for solving your own problem. If just everyone would do that … ;-)

Gumbo 2010-03-04 10:59:13

ansaurus

tags:

views:

answers:

Fuzzy Regular Expressions

related questions