views:

368

answers:

1

I have a stored procedure that uses Levenshtein distance to determine the result closest to what the user typed. The only thing really affecting the speed is the function that calculates the Levenshtein distance for all the records before selecting the record with the lowest distance (I've verified this by putting a 0 in place of the call to the Levenshtein function). The table has 1.5 million records, so even the slightest adjustment may shave off a few seconds. Right now the entire thing runs over 10 minutes. Here's the method I'm using:

ALTER function dbo.Levenshtein
( 
    @Source nvarchar(200), 
    @Target nvarchar(200) 
) 
RETURNS int
AS
BEGIN
DECLARE @Source_len int, @Target_len int, @i int, @j int, @Source_char nchar, @Dist int, @Dist_temp int, @Distv0 varbinary(8000), @Distv1 varbinary(8000)

SELECT @Source_len = LEN(@Source), @Target_len = LEN(@Target), @Distv1 = 0x0000, @j = 1, @i = 1, @Dist = 0

WHILE @j <= @Target_len
BEGIN
    SELECT @Distv1 = @Distv1 + CAST(@j AS binary(2)), @j = @j + 1
END

WHILE @i <= @Source_len
BEGIN
    SELECT @Source_char = SUBSTRING(@Source, @i, 1), @Dist = @i, @Distv0 = CAST(@i AS binary(2)), @j = 1

WHILE @j <= @Target_len
BEGIN
    SET @Dist = @Dist + 1
    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j-1, 2) AS int) +
                  CASE WHEN @Source_char = SUBSTRING(@Target, @j, 1) THEN 0 ELSE 1 END

    IF @Dist > @Dist_temp
    BEGIN
        SET @Dist = @Dist_temp
    END

    SET @Dist_temp = CAST(SUBSTRING(@Distv1, @j+@j+1, 2) AS int)+1

    IF @Dist > @Dist_temp SET @Dist = @Dist_temp
    BEGIN
        SELECT @Distv0 = @Distv0 + CAST(@Dist AS binary(2)), @j = @j + 1
    END
END

SELECT @Distv1 = @Distv0, @i = @i + 1
END

RETURN @Dist
END

Where should I go from here?

+4  A: 

The way I've done this in the past is to store the "database" (actually a dictionary of words for a spelling correcter) as a trie.

Then I used a branch-and-bound routine to look up nearest matching entries. For small distances, the time it takes is exponential in the distance. For large distances, it is linear in the size of the dictionary, just as you are seeing now.

Branch-and-bound is basically a depth-first tree walk of the trie, but with an error budget. At each node, you keep track of the current levenshtein distance, and if it exceeds the budget, you prune that branch of the tree.

First you do the walk with a budget of zero. That will only find exact matches. If you don't find a match, then you walk it with a budget of one. That will find matches at a distance of 1. If you don't find any, then you do it with a budget of 2, and so on. This sounds inefficient, but since each walk takes so much more time than the previous one, the time is dominated by the last walk that you make.

Added: outline of code (pardon my C):

// dumb version of trie node, indexed by letter. You can improve.
typedef struct tnodeTag {
  tnodeTag* p[128];
} tnode;

tnode* top; // the top of the trie

void walk(tnode* p, char* s, int budget){
  int i;
  if (*s == 0){
    if (p == NULL){
      // print the current trie path
    }
  }
  else if (budget >= 0){
    // try deleting this letter
    walk(p, s+1, budget-1);
    // try swapping two adjacent letters
    if (s[1]){
      swap(s[0], s[1]);
      walk(p, s, budget-1);
      swap(s[0], s[1]);
    }
    if (p){
      for (i = 0; i < 128; i++){
        // try exact match
        if (i == *s) walk(p->p[i], s+1, budget);
        // try replacing this character
        if (i != *s) walk(p->p[i], s+1, budget-1);
        // try inserting this letter
        walk(p->p[i], s, budget-1);
      }
    }
  }
}

Basically, you simulate deleting a letter by skipping it and searching at the same node. You simulate inserting a letter by descending the trie without advancing s. You simulate replacing a letter by acting as if the letter matched, even though it doesn't. When you get the hang of it, you can add other possible mismatches, like replacing 0 with O and 1 with L or I - dumb stuff like that.

You probably want to add a character array argument to represent the current word you are finding in the trie.

Mike Dunlavey
An outline would be really helpful. I understand walking with error budgets, but I don't really know how to do the depth-first tree walk...
Matt
@Matt: a depth first tree walk? You can just use a recursive dfs function, or use a stack. Look up dfs.
Cam
Great! I've been working through the code trying to translate it to SQL and it's working alright so far. I'm not quite sure how to turn the entire table into a Trie though and how to traverse it... It's not quite like C where we have pointers or anything. Anyone have any ideas? I'll probably post this as another question. Thanks again for the help!
Matt
@Matt: I don't think you can actually do this in SQL. I think you gotta grab all the data and build a trie in some other language.
Mike Dunlavey
Please check this question: http://stackoverflow.com/questions/2926790/how-can-i-store-data-in-a-table-as-a-trie-sql-server for details on how to do this in SQL.
Matt
Struggling a bit with a few parts of your code. Think you could rename some of the variables (p and s) or maybe add a bit more commenting in each section? It would really help out with finishing this up.
Matt
@Matt: `p` is just a pointer to the node, and it contains an array `p[]` indexed by a character, pointing to sub-nodes. `s` is just a pointer to character containing a null-terminated string.
Mike Dunlavey
@Mike: Yes but what is `s` for? You mention that I'll probably want to add a character array argument to represent what I'm searching for in the trie, but I thought that's what `s` was for. And why do you use 128 for the length your `p[]` array, would that be because of 128 ascii characters?
Matt
@Matt: `s` is the original array of characters of the string you are searching for. As the program is right now, there is nothing you can print to represent the word you have found. I assumed you could add that. And yes I assumed 128 ascii characters. That's how tries work.
Mike Dunlavey