Automatic regex builder

views:

656

answers:

+2 Q:

Automatic regex builder

I have N strings. Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.

I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which

1) minimizes K

2) minimizes L

3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".

All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.

A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.

Any ideas (or pointers to existing work) are greatly appreciated.

Nothing clever here, perhaps I don't fully understand the problem?

Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

benson 2009-05-21 22:46:11

I don't have any regexes to start with. The problem to deduce them.

Igor Krivokon 2009-05-21 22:57:20

+1 A:

The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?

Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.

Dave 2009-05-21 22:55:20

> What's your real problem?As a hobby project, I'm implementing a "magic editor" for big files, mostly data (plus occasional comments or "irregularities"). Quite often I need to change formatting, or remove a column of values, or something like this. Usually I do this kind of stuff with a quick perl one-liner. However, I wanted to create a more "visual" solution for people not familiar with regexes. They would just edit one line, and the other (similar) lines in the file change automatically.

Igor Krivokon 2009-05-21 23:16:51

Will check PADS, thanks!

Igor Krivokon 2009-05-21 23:52:59

+1 A:

What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.

I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).

Two interesting queries at Google Scholar:

Stephan202 2009-05-21 22:56:09

Thanks! Will check papers on MDL.

Igor Krivokon 2009-05-21 23:00:56

ansaurus

tags:

views:

answers:

Automatic regex builder

related questions