views:

128

answers:

3

I'm working on a fuzzy search implementation and as part of the implementation, we're using Apache's StringUtils.getLevenshteinDistance. At the moment, we're going for a specific maxmimum average response time for our fuzzy search. After various enhancements and with some profiling, the place where the most time is spent is calculating the Levenshtein distance. It takes up roughly 80-90% of the total time on search strings three letters or more.

Now, I know there are some limitations to what can be done here, but I've read on previous SO questions and on the Wikipedia link for LD that if one is willing limit the threshold to a set maximum distance, that could help curb the time spent on the algorithm, but I'm not sure how to do this exactly.

If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[3]

Below you will see the original LH code from StringUtils. After that is my modification. I'm trying to basically calculate the distances of a set length from the i,j diagonal (so, in my example, two diagonals above and below the i,j diagonal). However, this can't be correct as I've done it. For example, on the highest diagonal, it's always going to choose the cell value directly above, which will be 0. If anyone could show me how to make this functional as I've described, or some general advice on how to make it so, it would be greatly appreciated.

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

My modifications (only to the for loops):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }
+3  A: 

I've written about Levenshtein automata, which are one way to do this sort of check in O(n) time before, here. The source code samples are in Python, but the explanations should be helpful, and the referenced papers provide more details.

Nick Johnson
Great article, Nick!
Bart Kiers
That does seem like it would be useful, but I'm currently just trying to see what difference having just a threshold would make, as I'm not sure how much time I will have on this.
AHungerArtist
Plus, we're pretty close to reaching our desired mark, so a somewhat small change would be nicer than a large one.
AHungerArtist
As you said in the original question, if you have a threshold, it'll take O(n) time instead of O(mn) time. Modifying the dynamic programming procedure may well be simpler in your case, but I'm not sure how you'd go about that.
Nick Johnson
+1  A: 

Here someone answers a very similar question:

Cite:
I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */ 
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}  

just the main part, more code in the original

belisarius
A: 

I used the original code and places this just before the end of the j for loop:

    if (p[n] > s.length() + 5)
        break;

The +5 is arbitrary but for our purposes, if the distances is the query length plus five (or whatever number we settle upon), it doesn't really matter what is returned because we consider the match as simply being too different. It does cut down on things a bit. Still, pretty sure this isn't the idea that the Wiki statement was talking about, if anyone understands that better.

AHungerArtist