views:

63

answers:

5

Good afternoon,

Does anyone know of an "out-of-the-box" implementation of Levenshtein DFA (deterministic finite automata) in .NET (or easily translatable to it)? I have a very big dictionary with more than 160000 different words, and I want to, given an inicial word w, find all known words at Levenshtein distance at most 2 of w in an efficient way.

Of course, having a function which computes all possible edits at edit distance one of a given word and applying it again to each of these edits solves the problem (and in a pretty straightforwad way). The problem is effiency --- given a 7 letter word, this can already take over 1 second to complete, and I need something much more efficient --- if possible, as it is with Levenshtein DFAs, a solution that takes O(|w|) steps.

Edit: I know I can construct my own approach to the problem with a little bit of studying, but at the moment I can't afford reading Schulz and Mihov's 60-page-long articles.

Thank you very much.

A: 

I JUST GOOGLED AND FOUND THIS LINK

EDIT

For C# implementation

See Here

saurabh
Thank you so much, but that's not what I am looking for. I already wrote myself the code for that, but I need the DFA approach.
Miguel
+1  A: 

Here you go.

/// <summary>
/// Levenshtein Distance Calculator
/// </summary>
public static int DistanceFrom(this string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    // Step 1
    if (n == 0)
        return m;

    if (m == 0)
        return n;

    // Step 2
    for(int i = 0; i <= n; d[i, 0] = i++) ;
    for(int j = 0; j <= m; d[0, j] = j++) ;

    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
            // Step 5
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            // Step 6
            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
}
Nathan Ridley
Thanks but again I am not looking for a distance-calculation algorithm but rather for a DFA implementation of known word search.
Miguel
A: 

Sorry for being posting again, but I just would like to make clear that I am not looking for Levenshtein distance algorithms, but for a DFA (deterministic finite automata) implementation of similar word search based on a bounded Levenshtein distance.

Miguel
A: 

I understand you want to find near matches in a big dictionary. Here's the way I do it. link.

From what I'm able to figure out about DFA, I can't see how it's any better, or even actually any different, under the skin. NFAs might be faster, but that's because they don't exist. Maybe I'm wrong.

Mike Dunlavey
A: 

Hi Miguel.

We implemented this for apache lucene java, perhaps you could convert it to C# and save yourself time.

the main class is here: its just a builder to get Levenshtein DFAs from a string, using the Schulz and Mihov algorithm.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java

the parametric descriptions (the precomputed tables) for Lev1 and Lev2 are here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java

you might notice these are generated with a computer, we generated them with this script, using Jean-Phillipe Barrette's great moman implementation (python) http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py

we generate the parametric descriptions as packed long[] arrays so that it won't make our jar file too large.

just modify the toAutomaton(int n) to fit your needs/DFA package. in our case we are using a modified form of the brics automaton package, where transitions are represented as unicode codepoint ranges.

efficient unit tests are difficult for this sort of thing, but here is what we came up with... it seems to be thorough and even found a bug (which was fixed immediately by the author!) in the moman implementation.

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java

Robert Muir