views:

163

answers:

5

I have a project where I need to compare multi-chapter documents to a second document to determine their similarity. The issue is I have no idea how to go about doing this, what approaches exist or if their are any libraries available.

My first question is... what is similar? The numbers of words that match, the number of consecutive words that match?

I could see writing a parser that puts each document into an array with the word and location and then comparing them.

I saw the earlier question at http://stackoverflow.com/questions/220187/algorithms-or-libraries-for-textual-analysis-specifically-dominant-words-phras

however, it seems somewhat different than what I am attempting to do.

Any options or pointers people may have would be great!

A: 

diff tools used by all source control systems do almost exactly this. Try one of these to help you measure the number of differences (and hence how similar they are).

rikh
+1  A: 

"what is similar" we can't tell you that, this is a statement of a fundamental requirement of your project. If you don't know this, its a bit soon to think about how to do it.

It may be helpful to ask the question "why". What will the measure of similarity be used for?

If, for example, the purpose is to detect plagiarism then detecting that two essays are similar because they talk about the same subjects and make similar references is not likely to be helpful - the entire class would submit similar essays! So there you might be looking for matching exact sentences and phrases.

If instead you are trying build a catalogue for some documents then perhaps you would search out key words. Two documents are similar if they use the same vocabulary of words over a certian length, or similar proper nouns.

Those two examples are intended to demonstrate that until we understand what is meant by similar it is hard to give much advice.

However, here's a possible approach. You've could write two main things: an Extractor and a Comparator.

The extractor's job is to munge through the document and produce the set (or list, does it need to be ordered ?) of chunks that are the essence of the document: these might be individual words or sentences and phrases.

The comparator's job is to evaluate similarity of two documents "essence".

Simple example: extract the unique list of words of 8 letters or more from the document. Comparison could then be two documents are similar if one's set contains more than 75% of the others.

djna
A: 

It depends on what you want to achieve. If the goal is to find documents similar to a given document in a set of documents, you could try something like this:

Depending on the document, you could first extract the most meaningful keywords or key sentences out of the long documents to extract the essence of the text (google "keyword extraction"). Then you can work with text similarity algorithms (like k-nearest neighbor algorithm) to fish out similar documents. The key is to extract the key parts of the text.

Matt
+1  A: 

One simple approach is to concatenate the document text together, and then compress them. The compression ratio can tell you about how much similarity you have.

David Plumpton
+1  A: 

One approach that you can use is called Shingling. The process involves tokenizing all words in both documents eg.

D1 = {"An", "Example", "Document", "To", "Show", "Shingling"}
D2 = {"Another", "Example", "Document", "To", "Show", "Shingling", "but", "longer"}

Then take the set of contiguous sub-sequences of window length n (remember no duplicates in a set).

S(D1, 3) = {{"An", "Example", "Document"}, {"Example", "Document", "To"}, {"Document", "To", "Show"}, {"To", "Show", "Shingling"}}

S(D2, 3) = {{"Another", "Example", "Document"}, {"Example", "Document", "To"}, {"Document", "To", "Show"}, {"To", "Show", "Shingling"}, {"Show", "Shingling", "but"}, {"Shingling", "but", "longer"}}

Then the resemblance is the cardinality of the intersection divided by the cardinality of the union. So for our example 3/7 = 43% similar.

An efficient approximation can be made by using sketches (a subset from the set of shingles), randomly selected.

Robert