views:

166

answers:

3

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.

Here's my implementation:

public int calculate(String root,String query)
 {
  int arr[][] = new int[root.length()+2][query.length()+2];

  for(int i=2;i<root.length()+2;i++)
  {
   arr[i][0] = (int)root.charAt(i-2);
   arr[i][1] = (i-1);
  }

  for(int i=2;i<query.length()+2;i++)
  {
   arr[0][i] = (int)query.charAt(i-2);
   arr[1][i] = (i-1);
  }

  for(int i=2;i<root.length()+2;i++)
   for(int j=2;j<query.length()+2;j++)
   {
    int diff=0;
    if(arr[0][j]!=arr[i][0])
     diff = 1;
    arr[i][j]=min((arr[i-1][j]+1),(arr[i][j-1]+1),(arr[i-1][j-1]+diff));
   }

  return arr[root.length()+1][query.length()+1];
 }

 public int min(int n1, int n2, int n3)
 {
  return (int)Math.min(n1,Math.min(n2,n3));
 }
+4  A: 

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

Alex Martelli
I just found the same and was about to post it here. Thanks!
efficiencyIsBliss
You can always create with a small set of matches using whatever "approximate distance" metric you come up with, then use real Levenshtein to rank those.
kibibu
I read up on the optimization that the above post talks of, but I can't honestly say I get it. I'm not sure how to implement it either. Can someone offer some help?
efficiencyIsBliss
@Dharmesh, not in comments -- way too cramped -- but a separate question on _finding closest neighbor_ (the main optimization you can perform being to _not_ fully compute the distance in most cases -- and that optimization is only possible because you want the nearest neighbor, not for all distance computation tasks!) ideally with some samples of typical/interesting roots and sets of queries, should elicit lots of help (be sure to tag it as "algorithm" too!).
Alex Martelli
@Alex: that optimisation is also possible if the OP wanted to find all "likely" matches, "likely" being defined as e.g. `distance(query,candidate)/max(len(query),len(candidate)) < some_threshold_fraction`
John Machin
@Alex: I realized that the trick is to not compute the entire distance. I just don't fully understand what part of the entire distance I should compute and that is what I wanted help with. I'll make the new question with the name you suggested. If you could offer some more help there, I'd really appreciate it.
efficiencyIsBliss
@John: I don't think I need all the likely matches. I just need the distance of the closest match.
efficiencyIsBliss
I posted the new question under the name "Finding closest neighbour using optimized Levenshtein Algorithm". Here's the link: http://stackoverflow.com/questions/3195269/finding-closest-neighbour-using-optimized-levenshtein-algorithm
efficiencyIsBliss
@Dharmesh, I see you're getting excellent answers on the new question so hopefully your doubts about what to compute and how are now being quelled without the need for further input from me.
Alex Martelli
@Alex: Yeah, I got a lot of good help there. Some more wouldn't hurt though, if you have the time.
efficiencyIsBliss
+1  A: 

The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.

There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...

sleske
+2  A: 

According to a comment on this blog, Speeding Up Levenshtein, you can use VP-Trees and achieve O(nlogn). Another comment on the same blog points to a python implementation of VP-Trees and Levenshtein. Please let us know if this works.

Andrew B.