views:

605

answers:

5

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.

The Wikipedia article on edit distance gives some good background on the concept.

By taking "chunk transposition" into account, I mean that

Turing, Alan.

should match

Alan Turing

more closely than it matches

Turing Machine

I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.

The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).

A: 

I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

tvanfosson
The trouble with that is that I want to use this as an automated classifier, not as an interactive query suggestion device. Also, my main use case (identical words, different word order and punctuation) is a simple edit, e.g. a single call to transpose-words in Emacs. :)
Steven Huwig
+1  A: 

I think you're looking for Jaro-Winkler distance which is precisely for name matching.

bubaker
That seems to do character transposition but not character sequence transposition. In my use case, it's much more likely that the name will be spelled correctly than it will be in a consistent word order.
Steven Huwig
Although it does allow for multiple transpositions, you're right that it doesn't explicitly account for ones in sequence. Maybe you could try the suggestion to convert a sequence of words to characters from this related SO question: http;//stackoverflow.com/questions/828132/levenshtein-distance-how-to-better-handle-words-swapping-positions
bubaker
+1  A: 

You might find compression distance useful for this. See an answer I gave for a very similar question.

Or you could use a k-tuple based counting system:

  1. Choose a small value of k, e.g. k=4.
  2. Extract all length-k substrings of your string into a list.
  3. Sort the list. (O(knlog(n) time.)
  4. Do the same for the other string you're comparing to. You now have two sorted lists.
  5. Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
  6. The number of k-tuples in common is your similarity score.

With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.

j_random_hacker
+2  A: 

In the case of your application you should probably think about adapting some algorithms from bioinformatics.

For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.

If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.

Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.

Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.

Paul
+1  A: 

Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.

Thanks! I think this might actually work best.
Steven Huwig