tags:

views:

755

answers:

4

Hello,

I have a map in Java. I would like to compare a source string against all items in the map and return the best match based on a levenshtein ratio algorithm. I am wondering what the optimal way to perform this check on every element in the list would be.

Thanks, Matt

A: 

Since the levenshtein ratio depends both on the source and on the target, the values will change for each source string. Unless there is a high probability that the source string might be repeated on subsequent searches, just iterate over the map elements. If speed is truly an issue, make sure you are using the latest Java compilers and use optimization options.

David Medinets
A: 

And of course, if you're not already doing so, then use an off-the-shelf optimised Levenshtein implementation, like that in commons-lang StringUtils.

skaffman
A: 

If iterating over all map elements is too costly, you could consider using k-gram indexes.

Fabian Steeg
+3  A: 

You won't be able to get better than O(n) performance with a standard Map - just use the naive approach of testing them sequentially.

There are far more efficient ways to do this, though. One of them is called a bk-tree. Basically, you construct an n-way tree, with edges determined by the levenshtein distance between the nodes. Then, you can make use of the triangle inequality to massively cut down the nodes you have to search. For short distances, it's very efficient. Here's a blog article I wrote some time ago, describing it in detail. With a little extra work, you can query it for nearest-neighbour, rather than repeatedly querying with distance 1, 2, etc.

Nick Johnson
blog link doesn't work Nick!
Will
Fixed now, thanks.
Nick Johnson