views:

53

answers:

2

I'm playing around with Levenshteins Edit Distance algorithm, and I want to extend this to count transpositions -- that is, exchanges of adjacent letters -- as 1 edit. The unmodified algorithm counts insertions, deletes or substitutions needed to reach a certain string from another. For instance, the edit distance from "KITTEN" to "SITTING" is 3. Here's the explanation from Wikipedia:

  1. kitten → sitten (substitution of 'k' with 's')
  2. sitten → sittin (substitution of 'e' with 'i')
  3. sittin → sitting (insert 'g' at the end).

Following the same method, the edit distance from "CHIAR" to "CHAIR" is 2:

  1. CHIAR → CHAR (delete 'I')
  2. CHAR → CHAIR (insert 'I')

I would like to count this as "1 edit", since I only exchange two adjacent letters. How would I go about to do this?

+4  A: 

You need one more case in the algorithm from Wikipedia:

if s[i] = t[j] then 
  d[i, j] := d[i-1, j-1]
else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then
  d[i, j] := minimum
             (
               d[i-2, j-2] + 1 // transpose
               d[i-1, j] + 1,  // deletion
               d[i, j-1] + 1,  // insertion
               d[i-1, j-1] + 1 // substitution
             )
else
  d[i, j] := minimum
             (
               d[i-1, j] + 1,  // deletion
               d[i, j-1] + 1,  // insertion
               d[i-1, j-1] + 1 // substitution
             )
Mark Byers
@Mark Byers: woaw now I want to go back on my old SVN backups and find back my LED-modified algo and add this :) Is it really working? :)
Webinator
Fabolous! What can I say - it works like a charm :-) Thank you so much! My first approach was to make a seperate pass over the two strings and search for and fix adjacent exchanges, but the code became very ugly very quickly! Your solution is unbeliavably clean compared to mine - and in addition your solution works :-)
Svein Bringsli
Oops didn't update in the middle of typesetting my response. Minor comment, one could step back by increments of two if the last two chars are the same. +1
srean
+1  A: 

You have to modify how you update the dynamic programming table. In the original algorithm one considers the tails(or heads) of the two words that differ at the most by length one. The update is the minimum of all such possibilities.

If you want to modify the algorithm such that changes in two adjacent locations count as one, the minimum above has to be computed over tails(or heads) that differ by at most two. You can extend this to larger neighborhoods but the complexity will increase exponentially in the size of that neighborhood.

You can generalize further and assign costs that depend on the character(s) deleted, inserted or substituted, but you have to make sure that the cost you assign to a pair-edit is lower than two single edits, otherwise the two single edits will always win.

Let the words be w1 and w2

dist(i,j) = min(
                dist(i-2,j-2) && w1(i-1,i) == w2(j-1,j) else
                dist(i-1,j-1) && w1(i) == w2(j) else
                dist(i,j-1)   + cost(w2(j)),
                dist(i-1,j)   + cost(w1(i)),
                dist(i-1,j-1) + cost(w1(i), w2(j)),
                dist(i, j-2)  + cost(w2(j-1,j)),
                dist(i-2, j)  + cost(w1(i-1,i)),
                dist(i-2,j-2) + cost(w1(i-1,i), w2(j-1,j))
                ) 

What I mean by the && is that those lines should be considered only if the conditions are satisfied.

srean
+1, you have the right idea, but I was confused by "tails (or heads)", and the top 2 cases in your code snippet don't actually mention costs which is also slightly confusing.
j_random_hacker
@j_random_hacker Thanks for the upvote, its much appreciated. Yeah the explanation is contorted :(. I should have explained the dynamic programming using forward iteration or the backward iteration, not both. Just to clarify though, the cost is 0 for the top two cases because of exact match.
srean