views:

656

answers:

3

I have N strings. Also, there are K regular expressions, unknown to me. Each string is either matching one of the regular expressions, or it is garbage. There are total of L garbage strings in the set. Both K and L are unknown.

I'd like to deduce the regular expressions. Obviously, this problem has infinite number of solutions. I need to find a "reasonably good solution", which

1) minimizes K

2) minimizes L

3) maximizes "specifics" of the regular expressions. I don't know what't the right term for this quality. For example, the string "ab123" can be described as /ab\d+/ or /\w+.+/, but the first regex is more "specific".

All 3 requirements need to be taken as one compound criteria, with certain reasonable weights.

A solution for one particular case: If L = 0 and K = 1 (just one regex, and no garbage), then we can just find LCS (longest common subsequence) for the strings and come up with a corresponding regex from there. However, when we have "noise" (L > 0), this approach doesn't work.

Any ideas (or pointers to existing work) are greatly appreciated.

A: 

Nothing clever here, perhaps I don't fully understand the problem?

Why not just always reduce L to 0? Check each string against each regex; if a string doesn't match any of the regex's, it's garbage. if it does match, remember the regex/string(s) that did match and do LCS on each L = 0, K = 1 to deduce each regex's definition.

benson
I don't have any regexes to start with. The problem to deduce them.
Igor Krivokon
+1  A: 

The key words in academia are "grammatical inference". Unfortunately, there aren't any efficient, general algorithms to do the sort of thing you're proposing. What's your real problem?

Edit: it sounds like you might be interested in Data Description Languages. PADS (http://www.padsproj.org/) is a typical example.

Dave
> What's your real problem?As a hobby project, I'm implementing a "magic editor" for big files, mostly data (plus occasional comments or "irregularities"). Quite often I need to change formatting, or remove a column of values, or something like this. Usually I do this kind of stuff with a quick perl one-liner. However, I wanted to create a more "visual" solution for people not familiar with regexes. They would just edit one line, and the other (similar) lines in the file change automatically.
Igor Krivokon
Will check PADS, thanks!
Igor Krivokon
+1  A: 

What you are trying to do is language learning or language inference with a twist: instead of generalising over a set of given examples (and possibly counter-examples), you wish to infer a language with a small yet specific grammar.

I'm not sure how much research is being done on that. However, if you are also interested in finding the minimal (= general) regular expression that accepts all n strings, search for papers on MDL (Minimum Description Length) and FSMs (Finite State Machines).

Two interesting queries at Google Scholar:

Stephan202
Thanks! Will check papers on MDL.
Igor Krivokon