views:

144

answers:

2

I'm interested in a function of two word lists, which would return an order agnostic edit distance between them.

That is, the arguments would be two lists of (let's say space delimited) words and return value would be the minimum sum of the edit (or Levenshtein) distances of the words in the lists.

Distance between "cat rat bat" and "rat bat cat" would be 0. Distance between "cat rat bat" and "fat had bad" would be the same as distance between "rat bat cat" and "had fat bad", 4. In the case the number of words in the lists are not the same, the shorter list would be padded with 0-length words.

My intuition (which hasn't been nurtured with computer science classes) does not find any other solution than to use brute force:

   |had|fat|bad|   a solution
---+---+---+---+ +---+---+---+
cat| 2 | 1 | 2 | |   | 1 |   |
---+---+---+---+ +---+---+---+
rat| 2 | 1 | 2 | | 3 |   |   |
---+---+---+---+ +---+---+---+
bat| 2 | 1 | 1 | |   |   | 4 |
---+---+---+---+ +---+---+---+

Starting from the first row, pick a column and go to the next rows without ever revisiting a column you have already visited. Do this over and over again until you've tried all combinations.

To me this sounds a bit like the traveling salesman problem. Is it, and how would you solve my particular problem?

+8  A: 

As you already showed in the grid on the left, you can start by calculating the edit distances for every pair of words. This is easily done in polynomial time (n^2 edit distance calculations).

Then your problem can be described as a "minimum weighted bipartite matching", or equivalently, a "maximum weighted bipartite matching". This can also be done in polynomial time (faster than traveling salesman). See http://en.wikipedia.org/wiki/Matching_%28graph_theory%29#Maximum_matchings_in_bipartite_graphs

mathmike
+1 I've no idea what are you talking about, but you say it in a convincing way, hence I bet you are right.
Lo'oris
+1  A: 

This looks like a perfect oportunity to break out the Levenshtein distance algorithm. This algorithm will do exactly what you are looking for (the minimal distance between two strings) and it is pretty efficient too. As far as it being a variation of traveling salesman that would be a no because this can be solved in polynomial time due to the structure of the problem. Hope this helps.

Jeremy E
I'm using the Levenshtein distance to calculate distances between words. My problem is to find the word pairs which form the minimum sum of the Levenshtein distances.
Ville Koskinen