views:

437

answers:

5

I was reading the Java project idea described here:

The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.

This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.

A: 

You may try to use a basic inferring algorithm that has been used in other applications. I have implemented a very basic based on building a state machine. However, it only accounts for positive samples. The source code is on http://github.com/mvaled/inferdtd

Should could be interested in the AutomataInferrer.py which is very simple.

manu
A: 

RegexBuilder seems to have many of the features you're looking for.

Joe
It's not the same at all. Also, I haven't asked for a program, I asked for algorithm ideas.
Amir Rachum
Amir, but if you take the time to evaluate the program it could give you some ideas for the algorithm!
splash
+4  A: 

Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.

If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:

First pass: Want to match Cat, Catches cans Result: /Cat|Catches|Cans/

Second Pass: Look for similar starting conditions: Result: /Ca(t|tches|ans)/

Second Pass: Look for similar ending conditions: Result: /Ca(t|tch|an)s*/

Third Pass: Look for more refinements like repetitions and negative conditions

SDGator
I have thought of this approach, but it has several drawbacks: first of all, I don't see how it works for unmatched examples. Secondly, it will never come up with what I really expect. for example, if I get several examples like `a`, `aa`, `aaa`, `aaaa`, `aaaaa`, then I want it to come up with `a*` and not `aa?a?a?a?`. i.e, the shortest regex that matches/unmatches the examples provided.
Amir Rachum
Sounds like you want to make a compiler, then. Read the first two or three chapters of the Aho book. It details the building a DFA compiler to do this. We did this at my last company...a friend of mine wrote a compiler for a regex engine we implemented in HW. He also used a visualization library to diagram the DFA states the compiler came up with, and it sometimes came up with some really wild looking diagrams. We were optimizing for performance rather than shortness, though.
SDGator
+1  A: 

The program tries to deduce a regex that fits the examples

I don't think it's a useful question to ask. You have to know semantically what you need to represent to deduce something. When you write a regex, you have a purpose: accepting urls, accepting emails, extracting tokens from code, etc. I would redefine the question as so: given a knowledge base and a semantic for regular expression, compute the smallest regex. This get a step further, because you have natural language trying explaining a general expression and we all know how it get ambiguous! You have to have some semantic explanation. Without that, the best thing you can do for examples is to compute regex that cover all string from the ok list.

Algorithm for coverage:

Populate Ok List
Populate Not ok List
Check for repetitions
Check for contradictions ( the same string cannot be in both list )
Create Deterministic Finite Automata (DFA) from Ok List where strings from the list are final states
Simplify the DFA by eliminating repetitive states. ([1] 4.4 )
Convert DFA to regular expression. ([1] 3.2.2 )
Test Ok list and Not ok list


[1] Introduction to Automata Theory, Language, and Computation. J. Hopcroft, R. Motwani, J.D. Ullman, 2nd Edition, Pearson Education.

P.S.

I had some experience with genetic programming and I think it's really overhead for your problem. Since the objective function is really light it's better to evaluate with a single processor and this can take a lot of time. To have shorter expression you just need to minimize the DFA. But GA may possibly produce interesting result.

Guillaume Massé
+3  A: 

There exists algorithm that does exactly this for positive examples.

Regular expression are equivalent to DFA (Deterministic Finite Automata). The strategie is mostly always the same:

Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.

The Alergia algorithm and MDI are both described here: http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf

If you want to generate smaller models you can use another algorithm. The article describing it is here: http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz

His homepage is here: http://www.grappa.univ-lille3.fr/~lemay

If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.

If you ask these people, I am sure they will share their code source.

I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.

Mainly these algorithm work like that:

from positive examples: {abb, aba, abbb}

create the simplest DFA that accept exactly all these examples:

-> x -- a --> (y) -- b --> (z)
          \
            b --> t -- b --> (v)

x cant got to state y by reading the letter 'a' for example. The states are x, y, z, t and v. (z) means it is a finite state.

then "merge" states of the DFA: (here for example the result after merging states y and t.

               _
              /  \
             v    | a,b     ( <- this is a loop :-) )
x -- a -> (y,t) _/

the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it. Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.

Which states to merge is a choice that makes the main difference between algorithms.

yogsototh
How does the algorithm you described deal with negative examples?
Amir Rachum
In fact, I mostly worked with positive examples. For negative one, you should use a method of state coloring that prevent some merge. The algorithm is detailled here: http://www.irisa.fr/symbiose/old/people/coste/pub/icml97.ps
yogsototh