ansaurus

Question

Finding closest neighbour using optimized Levenshtein Algorithm

Answer 1

A:

I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}

Added to explain trie-search:

// definition of trie-node:
struct TNode {
  TNode* pa[128]; // for each possible character, pointer to subnode
};

// simple trie-walk of a node
// key is the input word, answer is the output word,
// i is the character position, and hdis is the hamming distance.
void walk(TNode* p, char key[], char answer[], int i, int hdis){
  // If this is the end of a word in the trie, it is marked as
  // having something non-null under the '\0' entry of the trie.
  if (p->pa[0] != null){
    if (key[i] == '\0') printf("answer = %s, hdis = %d\n", answer, hdis);
  }
  // for every actual subnode of the trie
  for(char c = 1; c < 128; c++){
    // if it is a real subnode
    if (p->pa[c] != null){
      // keep track of the answer word represented by the trie
      answer[i] = c; answer[i+1] = '\0';
      // and walk that subnode
      // If the answer disagrees with the key, increment the hamming distance
      walk(p->pa[c], key, answer, i+1, (answer[i]==key[i] ? hdis : hdis+1));
    }
  }
}
// Note: you have to edit this to handle short keys.
// Simplest is to just append a lot of '\0' bytes to the key.

Now, to limit it to a budget, just refuse to descend if hdis is too large.

Mike Dunlavey 2010-07-07 15:47:54

without DP or memoization, the running time of this algorithm doesn't appear to be O(kn)

Ron 2010-07-07 15:59:15

In the average case, this should still be faster than O(mn) because of the bounds on k. However, in the worst case this will be slower than O(mn).

efficiencyIsBliss 2010-07-07 16:18:10

@Dharmesh: I confess I've used it more in the context of spelling-correction, where it is conjoined with a trie-walk of a dictionary to find nearest-matching words.

Mike Dunlavey 2010-07-07 16:23:16

@Mike: I also have a trie walkthrough to find if an exact match exists. I'm using this in conjunction with another algorithm to find the nearest match if an exact match does not exist. After profiling my code, I found that the vast majority of the time was being spent calculating the Levenshtein distance, which is why I'm trying to optimize that bit right now.

efficiencyIsBliss 2010-07-07 17:19:49

@Dharmesh: Then my approach might help you, because I combine the distance calculation into the trie-walk. It's not a separate computation. I've posted this code elsewhere on SO: http://stackoverflow.com/questions/2918771/optimizing-levenshtein-distance-algorithm/2920693#2920693(It's been about 30 years, so it may be a bit rusty.)

Mike Dunlavey 2010-07-07 19:04:47

@Mike: I read your answer for the other question. The thing is, if the word starts with a letter that's incorrect then a trie is useless. I haven't had time to go through your code yet, so you might have accounted for this somehow. Thanks anyway!

efficiencyIsBliss 2010-07-07 22:32:19

@Dharmesh: No. Here's a simple way to think about it. You just walk the entire trie. Each time you descend to a subnode, you're committing to a letter in the dictionary. As you do that, keep track of the cost. If the letter is the same as the corresponding letter in the key, then the cost doesn't increase, otherwise it goes up by 1. So, when you get to each leaf, the cost = hamming distance. Then use the budget to prune the search, so you don't reach any leaves with hamming distance greater than the budget. Then generalize to more interesting changes.

Mike Dunlavey 2010-07-08 00:02:48

ansaurus

tags:

views:

answers:

Finding closest neighbour using optimized Levenshtein Algorithm

related questions