ansaurus

Question

Most efficient way to calculate Levenshtein distance

Answer 1

+4 A:

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

Alex Martelli 2010-07-06 02:46:38

I just found the same and was about to post it here. Thanks!

efficiencyIsBliss 2010-07-06 02:57:37

You can always create with a small set of matches using whatever "approximate distance" metric you come up with, then use real Levenshtein to rank those.

kibibu 2010-07-06 04:06:04

I read up on the optimization that the above post talks of, but I can't honestly say I get it. I'm not sure how to implement it either. Can someone offer some help?

efficiencyIsBliss 2010-07-07 02:29:23

@Dharmesh, not in comments -- way too cramped -- but a separate question on _finding closest neighbor_ (the main optimization you can perform being to _not_ fully compute the distance in most cases -- and that optimization is only possible because you want the nearest neighbor, not for all distance computation tasks!) ideally with some samples of typical/interesting roots and sets of queries, should elicit lots of help (be sure to tag it as "algorithm" too!).

Alex Martelli 2010-07-07 03:35:41

@Alex: that optimisation is also possible if the OP wanted to find all "likely" matches, "likely" being defined as e.g. `distance(query,candidate)/max(len(query),len(candidate)) < some_threshold_fraction`

John Machin 2010-07-07 07:31:34

@Alex: I realized that the trick is to not compute the entire distance. I just don't fully understand what part of the entire distance I should compute and that is what I wanted help with. I'll make the new question with the name you suggested. If you could offer some more help there, I'd really appreciate it.

efficiencyIsBliss 2010-07-07 13:10:18

@John: I don't think I need all the likely matches. I just need the distance of the closest match.

efficiencyIsBliss 2010-07-07 13:11:06

I posted the new question under the name "Finding closest neighbour using optimized Levenshtein Algorithm". Here's the link: http://stackoverflow.com/questions/3195269/finding-closest-neighbour-using-optimized-levenshtein-algorithm

efficiencyIsBliss 2010-07-07 13:42:08

@Dharmesh, I see you're getting excellent answers on the new question so hopefully your doubts about what to compute and how are now being quelled without the need for further input from me.

Alex Martelli 2010-07-07 18:31:13

@Alex: Yeah, I got a lot of good help there. Some more wouldn't hurt though, if you have the time.

efficiencyIsBliss 2010-07-07 19:47:57

Answer 2

+1 A:

The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.

There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...

sleske 2010-07-06 02:47:59

Answer 3

+2 A:

According to a comment on this blog, Speeding Up Levenshtein, you can use VP-Trees and achieve O(nlogn). Another comment on the same blog points to a python implementation of VP-Trees and Levenshtein. Please let us know if this works.

Andrew B. 2010-07-06 02:53:12

ansaurus

tags:

views:

answers:

Most efficient way to calculate Levenshtein distance

related questions