views:

381

answers:

4

I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8.

The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they really are.

I expect the text of the two documents to be either very similar or not at all, so I can be rather strict about the threshold to set for similarity. For example I could say that the two documents are similar only if they have 90% of their words in common, but I would like to have something more robust, which would work for texts short and long alike.

To sum it up I have:

  • two documents, either very similar or not similar at all, but:
  • it is more likely for the two documents to be similar than not
  • documents can be both long (some paragraphs) and short (a few sentences)

I've experimented with simmetrics, which has a large array of string matching function, but I'm most interested in suggestion about possible algorithms to use.

Possible candidates I have are:

  • Levenshtein: its output is more significant for short texts
  • overlapping coefficient: maybe, but will it discriminate well for documents of different lenght?

Also considering two texts similar only when they are exactly the same would not work well, because I'd like for documents that differ only for a few words to pass the similarity test.

thanks for your time

Silvio

+1  A: 

Levenshtein distance is the standard measure for a reason: it's easy to compute and easy to grasp the meaning of. If you are wary of the number of characters in a long document, you can just compute it on words or sentences or even paragraphs instead of characters. Since you expect the similar pairs to be very similar, that should still work well.

Kilian Foth
Yes, maybe due to the few differences that occur between documents, this cound work. Also, when there are many differences, I could return false early when the Levenshtein distance is becoming too big.
Silvio Donnini
+1  A: 

Levenshtein seems to be the best solution here. If you are trying to get a weighted similiarity ranking - which I guess is the case because you mentioned that the output of Levenshten is more significant for shorter texts - then just weight the result of the levenshtein algorithm by dividing by the number of characters in the document.

Jannik Jochem
+3  A: 

Levenshtein is appropriate for the edit distance between two words; if you are comparing documents, something like diff will probably be more along the lines of what you need.

I would start here: http://c2.com/cgi/wiki?DiffAlgorithm. They provide links to a number of diff-style algorithms you can look into.

danben
I agree with this. Levenshtein is horrible to use on whole documents. It's especially bad if you have N documents and you want to do pairwise similarity measurement. That would suggests something more like nearest-neighbor type search.
polygenelubricants
Maybe diff to find the areas that changed and then Levenshtein between those areas? As a bonus, then, you're only running the fiddly algorithm against small areas of the documents.
Chamelaeon
@Chamelaeon: that sounds interesting, I like your suggestion
Silvio Donnini
A: 

@Silvio Donnini, I'm interested in the subject too, care to share an update. Thanks.

Don Don
In my (very) particular case I found that Levenshtein alone works pretty well, better than any other similarity function or combination of functions
Silvio Donnini