ansaurus

Question

Is there an edit distance algorithm that takes "chunk transposition" into account?

Answer 1

A:

I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

tvanfosson 2009-05-18 15:11:28

The trouble with that is that I want to use this as an automated classifier, not as an interactive query suggestion device. Also, my main use case (identical words, different word order and punctuation) is a simple edit, e.g. a single call to transpose-words in Emacs. :)

Steven Huwig 2009-05-18 17:27:53

Answer 2

+1 A:

I think you're looking for Jaro-Winkler distance which is precisely for name matching.

bubaker 2009-05-18 15:26:23

That seems to do character transposition but not character sequence transposition. In my use case, it's much more likely that the name will be spelled correctly than it will be in a consistent word order.

Steven Huwig 2009-05-18 17:24:05

Although it does allow for multiple transpositions, you're right that it doesn't explicitly account for ones in sequence. Maybe you could try the suggestion to convert a sequence of words to characters from this related SO question: http;//stackoverflow.com/questions/828132/levenshtein-distance-how-to-better-handle-words-swapping-positions

bubaker 2009-05-18 19:39:19

Answer 3

+1 A:

You might find compression distance useful for this. See an answer I gave for a very similar question.

Or you could use a k-tuple based counting system:

Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.

With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.

j_random_hacker 2009-05-19 15:57:38

Answer 4

+2 A:

In the case of your application you should probably think about adapting some algorithms from bioinformatics.

For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.

If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.

Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.

Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.

Paul 2009-05-19 16:21:01

Answer 5

+1 A:

Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.

2009-08-19 22:57:15

Thanks! I think this might actually work best.

Steven Huwig 2009-09-03 11:23:21

ansaurus

tags:

views:

answers:

Is there an edit distance algorithm that takes "chunk transposition" into account?

related questions