I'm writing a piece of java software that has to make the final judgement on the similarity of two documents encoded in UTF-8.
The two documents are very likely to be the same, or slightly different from each other, because they have many features in common like date, location, creator, etc., but their text is what decides if they really are.
I expect the text of the two documents to be either very similar or not at all, so I can be rather strict about the threshold to set for similarity. For example I could say that the two documents are similar only if they have 90% of their words in common, but I would like to have something more robust, which would work for texts short and long alike.
To sum it up I have:
- two documents, either very similar or not similar at all, but:
- it is more likely for the two documents to be similar than not
- documents can be both long (some paragraphs) and short (a few sentences)
I've experimented with simmetrics, which has a large array of string matching function, but I'm most interested in suggestion about possible algorithms to use.
Possible candidates I have are:
- Levenshtein: its output is more significant for short texts
- overlapping coefficient: maybe, but will it discriminate well for documents of different lenght?
Also considering two texts similar only when they are exactly the same would not work well, because I'd like for documents that differ only for a few words to pass the similarity test.
thanks for your time
Silvio