views:

2248

answers:

6

I want to compare several strings to each other, and find the ones that are the most similar. I was wondering if there is any library, method or best practice that would return me which strings are more similar to other strings. For example:

  • "The quick fox jumped" -> "The fox jumped"
  • "The quick fox jumped" -> "The fox"

This comparison would return that the first is more similar than the second.

I guess I need some method such as:

double similarityIndex(String s1, String s2)

Is there such a thing somewhere?

EDIT: Why am I doing this? I am writing a script that compares the output of a MS Project file to the output of some legacy system that handles tasks. Because the legacy system has a very limited field width, when the values are added the descriptions are abbreviated. I want some semi-automated way to find which entries from MS Project are similar to the entries on the system so I can get the generated keys. It has drawbacks, as it has to be still manually checked, but it would save a lot of work

+4  A: 

You could use Levenshtein distance to calculate the difference between two strings. http://en.wikipedia.org/wiki/Levenshtein_distance

Florian
Levenshtein is great for a few strings, but will not scale to comparisons between a large number of strings.
spender
I've used Levenshtein in Java with some success. I havent done comparisons over huge lists so there may be a performance hit.Also it's a bit simple and could use some tweaking to raise the threshold for shorter words (like 3 or 4 chars) which tend to be seen as more similar than the should (it's only 3 edits from cat to dog)Note that the Edit Distances suggested below are pretty much the same thing - Levenshtein is a particular implementation of edit distances.
Rhubarb
+1  A: 

Theoretically, you can compare edit distances.

Anton Gogolev
+8  A: 

yes, there are many well documented algorithms like:

  • cosine similarity
  • Jaccard similarity
  • Dice's coefficient
  • Matching similarity
  • Overlap similarity
  • etc etc

alternatively you can check this

check also these probjects:

dfa
+2  A: 

This is typically done using an edit distance measure. Searching for "edit distance java" turns up a number of libraries, like this one.

Laurence Gonsalves
+1  A: 

Sounds like a plagiarism finder to me if your string turns into a document. Maybe searching with that term will turn up something good.

"Programming Collective Intelligence" has a chapter on determining whether two documents are similar. The code is in Python, but it's clean and easy to port.

duffymo