views:

73

answers:

5

Hay, I created a spider to crawl through a PDF document and log every word in the document into a table in a MySQL database.

Obviously words like 'the', 'and', 'or' etc appear in a book many, many times.

I'm just wondering what's the quickest method to remove dupe values from a table?

A: 

delete from words where idcolumn not in (select min(idcolumn) from words T2 where T2.plain = WordsTable.plain)

This works if you added (idcolumn, plain) for every word you found.

If you do not have an id column (pk) then you can use Anax's solution.

In addition to not inserting duplicates (codeburger comment), you can just set a unique index on your plain column.

pritaeas
the words table is called 'words' and the field containing the word is 'plain'
dotty
A: 

select distinct on word field, and then delete all rows that have a different id? I'm not a master in subqueries so no example atm :)

Nicky De Maeyer
+1  A: 

Instead of removing duplicates, you could make sure that no duplicates ever make it into the table.

Presuming your table has only 2 fields, id and word:

INSERT INTO table SELECT null, 'word' FROM table WHERE NOT EXISTS (SELECT * FROM table WHERE word = 'word') LIMIT 1;

This will insert the word into the table only if it's not already in there

code_burgar
+3  A: 

Create a table without indexing the words and put in all the words from the book using mass inserts (you could also use LOAD DATA). When you're done with insertions, add a new Index on the word field

Then create a second table using:

CREATE TABLE newTable SELECT DISTINCT word FROM oldTable
Anax
+1  A: 

If you can rerun the script to populate the database, you could add a unique key on the "word" field and instead of INSERT INTO do a REPLACE INTO. This will delete the previous instance of the record before adding a duplicate field. This may not be the most efficient way to do it, but it's rather simple. See here for more details:

http://dev.mysql.com/doc/refman/5.0/en/replace.html

Mike A.