views:

65

answers:

3

I have a list of people that I'd like to search through. I need to know 'how much' each item matches the string it is being tested against.

The list is rather small, currently 100+ names, and it probably won't reach 1000 anytime soon.
Therefore I assumed it would be OK to keep the whole list in memory and do the searching using something Java offers out-of-the-box or using some tiny library that just implements one or two testing algorithms. (In other words without bringing-in any complicated/overkill solution that stores indexes or relies on a database.)

What would be your choice in such case please?

EDIT: Seems like Levenshtein has closest to what I need from what has been adviced. Only that gets easily fooled when the search query is "John" and the names in list are significantly longer.

+1  A: 

If you are looking for a 'how much' match, you should use Soundex. Here is a Java implementation of this algorithm.

Vijay Mathew
Thanks Vijay, I will look into it. Hopefully it copes well with non-english names.
Jaroslav Záruba
There are also other options. http://stackoverflow.com/questions/42013/levenshtein-distance-based-methods-vs-soundex
Chris Nava
I would not use soundex even for English names. If you check the algorithm it uses (wikipedia has nice description) you will see that it is very far from being even close to perfect.
spbfox
Soundex has multiple issues. Probably not suited to your data. See my response below.
Mikos
+1  A: 

Check out Double Metaphone, an improved soundex from 1990.

http://commons.apache.org/codec/userguide.html

http://svn.apache.org/viewvc/commons/proper/codec/trunk/src/java/org/apache/commons/codec/language/DoubleMetaphone.java?view=markup

Adriaan Koster
This does not seem to return the relevance. Maybe I should have asked for comparing two strings...?
Jaroslav Záruba
+1  A: 

You should look at various string comparison algorithms and see which one suits your data best. Options are Jaro-Winkler, Smith-Waterman etc. Look up SimMetrics - a F/OSS library that offers a very comprehensive set of string comparison algorithms.

Mikos
Well, I have already dusted-off my own distance computing function that dates back to my PHP-times. But I'm definitely gonna look into this. Thanks mikOS.
Jaroslav Záruba