tags:

views:

81

answers:

1

Is it possible ( or why not possible ) to convert input string to a string that match regex in least Levenshtein distance ?

i.e. if 1234 is string and ^([0-9]{6})$ is regex, i need output something like 123412 ( output string matches the regex and is 2 distance from original string, there may be other string but first result will do )

How to do this ? ( no brute force..)

edit:

in other possibilities, can I get Levenshtein distance only ? ( without matching string ... )

or what other information apart form boolean( match or not match ) can regex give ?

+3  A: 

If you know about finite automaton you could construct one which represents your regex (there are libraries to do so). Then you run it with your string (1234) and you would end up in some state. From this state you do a breath-first search until you reach a accept state. While you searching you keep track of which transitions (characters) you run over. And the characters will give you the shortest (or one of them) string which qualify your regex.

Added link: you may have a look on http://www.brics.dk/automaton/ which is a automaton library implemented at Aarhus University (BSD license)

Update: I have build what you seek with the automaton implementation from above. First, the ExtendedOperations class which is in the same package as the other automaton classes because I needed to access some methods.

package dk.brics.automaton;

public class ExtendedOperations {
    //Taken from Automaton.run and modified to just return state instead of accept (bool)
    static State endState(String s, Automaton a)
    {
        if (!a.deterministic) a.determinize();

        State p = a.initial;
        for (int i = 0; i < s.length(); i++) {
            p = p.step(s.charAt(i));
            if (q == null) return null;
        }
        return p;
    }

    public static String getShortestCompletion(Automaton a, String partlyInput)
    {
        State e = endState(partlyInput, a);

        if (e == null) return null;

        return BasicOperations.getShortestExample(e, true);
    }
}

Second, a little testsample:

package subsetautomaton;

import dk.brics.automaton.*;

public class Main {
    public static void main(String[] args) {
        RegExp re = new RegExp("[a-zA-Z0-9._%+-]+\\@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,6}");
        Automaton a = re.toAutomaton();

        System.out.println(ExtendedOperations.getShortestCompletion(a, "a"));
    }
}

The example is a naive email address reg. exp. Notice that ^ is implicit in the reg. exp. and the same thing with $. Second, @ is escaped with \ because it means 'any string' in this implementation.

The result of the example above is: @-.AA

lasseespeholt