ansaurus

Question

Levenshtein distance: how to better handle words swapping positions?

Answer 1

+6 A:

Its easy. Just use the Damerau-Levenshtein distance on the words instead of letters.

Unknown 2009-05-06 05:24:34

Do you mean: "for every word in A, find the levenshtein distance to every word in B, then add up your results"?

thomasrutter 2009-05-06 05:29:18

No, I mean turn every word into a symbol: ie The = a, quick = b, brown = c, etc. And then run the levenshtein algorithm on that.

Unknown 2009-05-06 05:34:21

No I see what you mean, you mean implement the levenshtein algorithm which compares words rather than letters. Unfortunately this still not work for me, as two words which swap position with each other would still count the same as deleting a word and creating an entirely different word.

thomasrutter 2009-05-06 05:34:40

Ie levenshtein("abcd", "cbad") is still no more similar than levenshtein("abcd", "abxy")

thomasrutter 2009-05-06 05:36:10

Then you might look at similar algorithms like http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

Unknown 2009-05-06 05:39:05

I like the sound of this Damerau-Levenshtein distance (with transpositions). Only thing I'm worried about now is how much slower it's going to be implementing it in the PHP code. Thanks for the tip!

thomasrutter 2009-05-06 05:52:52

Answer 2

+2 A:

You can also try this. (just an extra suggestion)

$one = metaphone("The quick brown fox"); // 0KKBRNFKS
$two = metaphone("brown quick The fox"); // BRNKK0FKS
$three = metaphone("The quiet swine flu"); // 0KTSWNFL

similar_text($one, $two, $percent1); // 66.666666666667
similar_text($one, $three, $percent2); // 47.058823529412
similar_text($two, $three, $percent3); // 23.529411764706

This will show that the 1st and 2nd are more similar than one and three and two and three.

Ólafur Waage 2009-05-06 09:34:45

I think this improvement in score would be more from the use of similar_text rather than from metaphone. I'm currently using a phoenetic algorithm very similar to metaphone. I haven't looked much into the algorithm similar_text uses. I was under the impression it was a lot less efficient than levenshtein, but I guess you get what you pay for. I might try it.

thomasrutter 2009-05-06 12:21:01

I tried with only similar text and it gave a much lower score and a lower score between one and two, than one and three.

Ólafur Waage 2009-05-06 13:02:15

Answer 3

A:

Explode on spaces, sort the array, implode, then do the Levenshtein.

rooskie 2009-05-06 17:06:14

Answer 4

+1 A:

Take this answer and make the following change:

void match(trie t, char* w, string s, int budget){
  if (budget < 0) return;
  if (*w=='\0') print s;
  foreach (char c, subtrie t1 in t){
    /* try matching or replacing c */
    match(t1, w+1, s+c, (*w==c ? budget : budget-1));
    /* try deleting c */
    match(t1, w, s, budget-1);
  }
  /* try inserting *w */
  match(t, w+1, s + *w, budget-1);
  /* TRY SWAPPING FIRST TWO CHARACTERS */
  if (w[1]){
    swap(w[0], w[1]);
    match(t, w, s, budget-1);
    swap(w[0], w[1]);
  }
}

This is for dictionary search in a trie, but for matching to a single word, it's the same idea. You're doing branch-and-bound, and at any point, you can make any change you like, as long as you give it a cost.

Mike Dunlavey 2009-05-06 17:24:49

This looks like it could be quite useful, though it will take a bit of research on my part to figure out how it works. I haven't used a Trie before, so I'll investigate.

thomasrutter 2009-05-07 13:13:48

@thomas. You only need the trie if you're searching a dictionary. If you're just comparing two strings (or lists of things), the "foreach" just becomes a simple statement block. Recursive branch-and-bound is a pretty useful Swiss Army knife.

Mike Dunlavey 2009-05-07 14:25:13

Answer 5

A:

Eliminate duplicate words between the two strings and then use Levenshtein.

JRL 2009-05-06 17:41:49

Answer 6

A:

I've been implementing levenshtein in a spell checker.

What you're asking for is counting transpositions as 1 edit.

This is easy if you only wish to count transpositions of one word away. However for transposition of words 2 or more away, the addition to the algorithm is worst case scenario !(max(wordorder1.length(), wordorder2.length())). Adding a non-linear subalgorithm to an already quadratic algorithm is not a good idea.

This is how it would work.

if (wordorder1[n] == wordorder2[n-1])
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1, workarray[x-2, y-2]);
}
  else
{
  min(workarray[x-1, y] + 1, workarray[x, y-1] + 1);
}

JUST for touching transpositions. If you want all transpositions, you'd have to for every position work backwards from that point comparing

1[n] == 2[n-2].... 1[n] == 2[0]....

So you see why they don't include this in the standard method.

2009-06-05 19:44:12

Answer 7

A:

i believe this is a prime example for using a vector-space search engine.

in this technique, each document essentially becomes a vector with as many dimensions as there are different words in the entire corpus; similar documents then occupy neighboring areas in that vector space. one nice property of this model is that queries are also just documents: to answer a query, you simply calculate their position in vector space, and your results are the closest documents you can find. i am sure there are get-and-go solutions for PHP out there.

to fuzzify results from vector space, you could consider to do stemming / similar natural language processing technique, and use levenshtein to construct secondary queries for similar words that occur in your overall vocabulary.

flow 2010-08-07 20:57:56

ansaurus

tags:

views:

answers:

Levenshtein distance: how to better handle words swapping positions?

related questions