views:

93

answers:

4

What algorithm would you suggest to identify how much from 0 to 1 (float) two texts are identical?

Note that I don't mean similar (ie, they say the same thing but in a different way), I mean exact same words, but one of the two texts could have extra words or words slightly different or extra new lines and stuff like that.

A good example of the algorithm I want is the one google uses to identify duplicate content in websites (X search results very similar to the ones shown have been omitted, click here to see them).

The reason I need it is because my website has the ability for users to post comments; similar but different pages currently have their own comments, so many users ended up copy&pasting their comments on all the similar pages. Now I want to merge them (all similar pages will "share" the comments, and if you post it on page A it will appear on similar page B), and I would like to programatically erase all those copy&pasted comments from the same user.

I have quite a few million comments but speed shouldn't be an issue since this is a one time thing that will run in the background.

The programming language doesn't really matter (as long as it can interface to a MySQL database), but I was thinking of doing it in C++.

+2  A: 

Would the Longest Common Subsequence algorithm fill the bill? It's basically what diff uses. There's a dynamic programming algorithm that allows you to solve such problems efficiently. The Wikipedia page I linked to has all the information you need.

To experiment with it in a nice and friendly way, you can use the Python difflib module which implements it. It contains a difflib.SequenceMatcher class that has a ratio method, which:

Return a measure of the sequences’ similarity as a float in the range [0, 1].

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.

Eli Bendersky
+3  A: 

Robust similarity comparisons, e.g. Levenshtein distance are typically expensive. With many different texts to compare, you also run into the problem of an immense number of potential pairwise comparisons.

A more practical technique for your case would probably be Karb-Rabin fingerprinting.

Rob Lachlan
+1  A: 

Cosine Similarity

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°. - Wikipedia

EDIT:

SIMILAR but different pages currently have their own comments, so many users ended up copy and pasting their comments on all the SIMILAR pages.

This similarity can be exploited.

  1. Find similar Posts.
  2. Find users COMMON to the posts just ignore others.

This grouping should reduce your task :)

TheMachineCharmer
+1  A: 

Cosine Similarity is a good measure. See chapters 6-7 of Introduction to Information Retrieval at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Michael Munsey