ansaurus

Question

How to modify Levenshteins Edit Distance to count "adjacent letter exchanges" as 1 edit

Answer 1

+4 A:

You need one more case in the algorithm from Wikipedia:

if s[i] = t[j] then 
  d[i, j] := d[i-1, j-1]
else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then
  d[i, j] := minimum
             (
               d[i-2, j-2] + 1 // transpose
               d[i-1, j] + 1,  // deletion
               d[i, j-1] + 1,  // insertion
               d[i-1, j-1] + 1 // substitution
             )
else
  d[i, j] := minimum
             (
               d[i-1, j] + 1,  // deletion
               d[i, j-1] + 1,  // insertion
               d[i-1, j-1] + 1 // substitution
             )

Mark Byers 2010-10-29 20:10:42

@Mark Byers: woaw now I want to go back on my old SVN backups and find back my LED-modified algo and add this :) Is it really working? :)

Webinator 2010-10-29 20:13:32

Fabolous! What can I say - it works like a charm :-) Thank you so much! My first approach was to make a seperate pass over the two strings and search for and fix adjacent exchanges, but the code became very ugly very quickly! Your solution is unbeliavably clean compared to mine - and in addition your solution works :-)

Svein Bringsli 2010-10-29 20:28:12

Oops didn't update in the middle of typesetting my response. Minor comment, one could step back by increments of two if the last two chars are the same. +1

srean 2010-10-29 20:43:43

Answer 2

+1 A:

You have to modify how you update the dynamic programming table. In the original algorithm one considers the tails(or heads) of the two words that differ at the most by length one. The update is the minimum of all such possibilities.

If you want to modify the algorithm such that changes in two adjacent locations count as one, the minimum above has to be computed over tails(or heads) that differ by at most two. You can extend this to larger neighborhoods but the complexity will increase exponentially in the size of that neighborhood.

You can generalize further and assign costs that depend on the character(s) deleted, inserted or substituted, but you have to make sure that the cost you assign to a pair-edit is lower than two single edits, otherwise the two single edits will always win.

Let the words be w1 and w2

dist(i,j) = min(
                dist(i-2,j-2) && w1(i-1,i) == w2(j-1,j) else
                dist(i-1,j-1) && w1(i) == w2(j) else
                dist(i,j-1)   + cost(w2(j)),
                dist(i-1,j)   + cost(w1(i)),
                dist(i-1,j-1) + cost(w1(i), w2(j)),
                dist(i, j-2)  + cost(w2(j-1,j)),
                dist(i-2, j)  + cost(w1(i-1,i)),
                dist(i-2,j-2) + cost(w1(i-1,i), w2(j-1,j))
                )

What I mean by the && is that those lines should be considered only if the conditions are satisfied.

srean 2010-10-29 20:39:34

+1, you have the right idea, but I was confused by "tails (or heads)", and the top 2 cases in your code snippet don't actually mention costs which is also slightly confusing.

j_random_hacker 2010-10-30 07:35:08

@j_random_hacker Thanks for the upvote, its much appreciated. Yeah the explanation is contorted :(. I should have explained the dynamic programming using forward iteration or the backward iteration, not both. Just to clarify though, the cost is 0 for the top two cases because of exact match.

srean 2010-10-30 07:43:23

ansaurus

tags:

views:

answers:

How to modify Levenshteins Edit Distance to count "adjacent letter exchanges" as 1 edit

related questions