views:

111

answers:

4

Suppose I have this table:

ID | description
-------------------
5  | The bird flew over the tree.
2  | The birds, flew over the tree

These two rows have "similar" content. How would I remove #2?

  1. What algorithm should I use for "similar" text?
  2. How would I do this with Python?

Thanks!

+3  A: 

Typically, for each value, you'd create a 'simplified' value (remove whatever wasn't essential ... in your example, the punctuation and pluralization), and then compare the simplified values for equality.

Joe
+1  A: 

Look here for some inspiration.

fvu
A: 

You could use the LIKE operator.

DELETE FROM myTable WHERE description LIKE 'The bird%flew over the tree%';
Dan Dyer
+5  A: 

What you could try is stripping necessary punctuation and running each sentence through a stemmer (e.g. a Porter Stemmer).

Once you have a stemmed version of the sentence you could store that in another column for comparison. However, you may find it more space efficient to hash the stemmed sentence if the sentences are long (e.g. over 40 chars on average).

Any rows which share the same stemmed sentence or hash will be highly likely to be equivalent - you could automate their removal, or create a UI to enable a human to rapidly approve each one.

Here's a Python implementation of the Porter stemmer.

Paul Dixon
Umm, why hash them?
n1313
Just to have a short "code" for the stemmed sentence to avoid too much overhead. Will modify answer to clarify, thanks for raising it.
Paul Dixon