views:

4232

answers:

4

I'm looking for high performance Java library for fuzzy string search.

There are numerous algorithms to find similar strings, Levenshtein distance, Daitch-Mokotoff Soundex, n-grams etc.

What Java implemenations exists? Pros and cons for them? I'm aware of Lucene, any other solution or Lucene is best?

I found these, anyone has experience with them?
SimMetrics
NGramJ

A: 

Lucene is the only way, i think. I dont know any better search lib.

Vugluskr
+4  A: 

Commons Lang has an implementation of Levenshtein distance.

Commons Codec has an implementation of soundex and metaphone.

JodaStephen
+4  A: 

SimMetrics is probably what you need: http://www.dcs.shef.ac.uk/~sam/simmetrics.html

It has several algorithms for calculating various flavours of edit-distance.

Lucene is a very powerful full-text search engine, but FT search isn't exactly the same thing as fuzzy string matching (eg. given a list of strings find me the one that is most similar to some candidate string).

Darren
A: 

You can try bitap. I was playing with bitap written in ANSI C and it was pretty fast there is java implementation in http://www.crosswire.org.

Mojo Risin