ansaurus

Question

How to remove "similar" but not identical content in a MySQL database.

Answer 1

+3 A:

Typically, for each value, you'd create a 'simplified' value (remove whatever wasn't essential ... in your example, the punctuation and pluralization), and then compare the simplified values for equality.

Joe 2009-10-04 12:28:06

Answer 2

+1 A:

Look here for some inspiration.

fvu 2009-10-04 12:49:34

Answer 3

A:

You could use the LIKE operator.

DELETE FROM myTable WHERE description LIKE 'The bird%flew over the tree%';

Dan Dyer 2009-10-04 14:08:46

Answer 4

+5 A:

What you could try is stripping necessary punctuation and running each sentence through a stemmer (e.g. a Porter Stemmer).

Once you have a stemmed version of the sentence you could store that in another column for comparison. However, you may find it more space efficient to hash the stemmed sentence if the sentences are long (e.g. over 40 chars on average).

Any rows which share the same stemmed sentence or hash will be highly likely to be equivalent - you could automate their removal, or create a UI to enable a human to rapidly approve each one.

Here's a Python implementation of the Porter stemmer.

Paul Dixon 2009-10-04 14:17:25

Umm, why hash them?

n1313 2009-10-04 14:21:27

Just to have a short "code" for the stemmed sentence to avoid too much overhead. Will modify answer to clarify, thanks for raising it.

Paul Dixon 2009-10-04 14:27:02

ansaurus

tags:

views:

answers:

How to remove "similar" but not identical content in a MySQL database.

related questions