tags:

views:

194

answers:

9

I'd like to know if there is any class in Java able to check, using its own criteria, how much a String is equal to another one. Example :

  • William Shakespeare / William Shakespeare : might be 100%
  • William Shakespe*a*re / William Shakespe*e*re : might have above 90%
  • William Shakespeare / Shakespeare, William : might have above 70% (just examples)
+2  A: 

Generally, there is the levenshtein algorithm, which just outputs how many insert/update/delete operations you would have to perform (characterwise) to transform one string into another. Apache's StringUtils class has an implementation.

soulmerge
AFAIK levenshtein does not take the "sound" of a word into account.
DR
+1  A: 

You can use: Class Soundex

codaddict
+5  A: 

You have to use a "soft" string metric:

There are many others, see String Metrics for an overview.

The best algorith highly depends on the problem field. For example, SoundEx degrades for Eastern European names and the Hamming distance does not help you much if you want to compare the similiarity of "real world" words.

DR
+1  A: 

This is called SoundEx, lookup java soundex for several implementations.

one of them is apache soundex which looks good (although I haven't used it myself).

Omry
+1  A: 

Sounds like SoundEx, an implementation is available in Apache Commons.

Anders Lindahl
+1  A: 

You can try a SoundEx algorithm.

Hans Kesting
+7  A: 

I see two main candidates:

  • The Soundex encoding, implemented by Apache Commons. However, note that it's mainly meant for single, relatively short words. It won't find a similarity in your third example. Additionally, it really only works for English words.
  • The Levenshtein distance (Again implemented at Apache Commons). This is language agnostic, but similarity for switched parts as in your third example will be relatively low (more like 40%). Modifications like the Damerau–Levenshtein distance may yield better results.
Michael Borgwardt
(+1) The `Metaphone` and `DoubleMetaphone` algorithms in Commons Codec give better results that SOUNDEX, in my experience.
skaffman
A: 

String matching is very problem-specific, because most of the time you will have the same characteristics of noise in your strings to be matched, be it extra punctuation, typos or spelling errors. You will need to find an algorithm that is appropriate for the problems in your input data if you are doing this on a wide scale.

Soundex will give you a degree of confidence that two strings sound the same, but you may have to do some upfront cleaning first (like removing punctuation and tokenizing the string into separate words).

The best thing you can do is to run a test, there are an enormous amount of different algorithms you can use, levenshtein being a great one, as is soundex (although your mileage will vary with your problem area). There are also variations on those two algorithms, BTW.

I suggest having a look at the simmetrics and second string libraries which have loads of string matching implementations (of the two I prefer the second string library).

It sounds like you have an interesting problem to solve, good luck!

James B
A: 

try SimMetrics- open source library including SoundEx and ChapmanMatchingSoundex which would give a far better score for the examples given. i.e. Will Shake vs Shake, Will this approach uses a matching approach on-top of SoundEx. Another metric you may want to try which although not phonetic scores very well regardless (if not better in differing name matching tasks) is the q-Grams metric in the same library.

Sam Chapman