views:

56

answers:

3

I saw the solution to create an alternate temporary MySQL table with unique rows, but I didn't like that idea, as my tables are very large and would be a hassle to move them (and would create huge problems if there would be errors during the move).

I did, however, find the following. What do you think of this (where the duplicates to check is "field_name")?

DELETE FROM table1
USING table1, table1 as vtable
WHERE (NOT table1.ID=vtable.ID)
AND (table1.field_name=vtable.field_name)

Somebody said this should work, but I'm not quite sure. What do you think? Also, will having indexes at all alter the performance of this command, say, having an index on "field_name"?

Thanks a lot.

EDIT: Would there be any way to test the query before running it? As far as I know, MySQL doesn't support "explain" on DELETE queries.

A: 

That query should work. Having indexes will alter the performance but it really depends on the size of the table.

As for testing this out, I would copy a subset of the data to a temporary table and run the command on the temp table before you run it on your real table.

Remember always back up tables before preforming any major batch jobs so you can always roll back.

Michael Bazos
A: 

The method that I use avoids a JOIN condition and should be significantly faster:

DELETE FROM table1 WHERE id NOT IN (SELECT MIN(x.id) FROM table1 AS x GROUP BY x.field_name);

The subselect gathers a list of id that you want to keep. This will allow you to keep a unique row for each field_name. The DELETE statement will then delete all extra duplicate rows.

Also, yes, the index on the field_name field will improve performance in your query.

thetaiko
Virtually any use of `GROUP BY` in MySQL invokes a temporary table, which seriously harms performance.
Bill Karwin
@Bill - just did a quick test on one of my test databases. Granted, only 30,000 rows, but the `GROUP BY` significantly outperforms the `JOIN` in the cases where I use it.
thetaiko
Well, that's good. I hope it also applies to the OP's database.
Bill Karwin
+1  A: 

Note that the query you show will delete both duplicates. I would assume you want to keep one or the other.

Here's how I would write this query:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 
  ON t1.id > t2.id AND t1.field_name = t2.field_name;

By using greater-than instead of not-equals-to, you only delete one row (the later one), instead of both.

A compound index over (id, field_name) may help. You should confirm this with MySQL's EXPLAIN to get the optimization report. But EXPLAIN only supports SELECT queries so you should run an equivalent SELECT to confirm the optimization:

EXPLAIN SELECT * FROM table1 AS t1 JOIN table1 AS t2 
  ON t1.id > t2.id AND t1.field_name = t2.field_name;

You also asked about testing. I'd recommend copying a sample of rows containing duplicates to a table in your test database:

CREATE TABLE test.table1test SELECT * FROM realdb.table1 LIMIT 10000;

Now you can perform experiments on your sample data until you're satisfied the DELETE solution is correct.

USE test;
SET autocommit = 0;
DELETE ... 
ROLLBACK;

I'd recommend naming your scratch table in the test database something distinct from your real table in your real database. Just in case you run an experimental DELETE while you are accidentally still using your real database as the default database!


Re your comments:

USE test is a mysql client builtin command. It sets the test database as the default database. This will be the default database when you name tables in your queries without qualifying them with a database name. See http://dev.mysql.com/doc/refman/5.1/en/use.html

SET autocommit = 0 turns off the default behavior of committing a transaction for each query implicitly. So you must explicitly give the COMMIT or ROLLBACK command to finish a transaction. See http://dev.mysql.com/doc/refman/5.1/en/commit.html

It's worthwhile to use ROLLBACK when you're experimenting because it discards the changes made in that transaction. It's a quick way to return to the initial state of your test data so you can try another experiment.

DELETE t1 is not a typo. DELETE deletes rows, not whole tables. t1 is an alias to each row that satisfies the conditions of the statement (although it is possible that the conditions include every row in the table). See description of multi-table delete at http://dev.mysql.com/doc/refman/5.1/en/delete.html

Sort of like when you run a loop in PHP and you use a variable to iterate over the loop: for ($i=0; $i<100; ++$i) ... The variable $i takes on a series of values, and each time through the loop it has a different value.

Here's a demo showing how my solution deletes multiple duplicates. I ran this in my test database and I'm pasting the result directly from my command window:

mysql> create table table1 (id serial primary key, field_name varchar(10));
Query OK, 0 rows affected (0.45 sec)

mysql> insert into table1 (field_name) 
       values (42), (42), (42), (42), (42), (42);
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> select * from table1;
+----+------------+
| id | field_name |
+----+------------+
|  1 | 42         | 
|  2 | 42         | 
|  3 | 42         | 
|  4 | 42         | 
|  5 | 42         | 
|  6 | 42         | 
+----+------------+
6 rows in set (0.00 sec)

mysql> delete t1 from table1 t1 join table1 t2 
       on t1.id > t2.id and t1.field_name = t2.field_name;
Query OK, 5 rows affected (0.00 sec)

mysql> select * from table1;
+----+------------+
| id | field_name |
+----+------------+
|  1 | 42         | 
+----+------------+
1 row in set (0.00 sec)
Bill Karwin
Thanks for the help Bill! You have gotten an upvote on your answer, so I'll just assume your answer is the best solution (not saying the others are bad). Would you mind explaining, please, what the block of code means that starts with "USE test; SET..."? Also, just to make sure, the greater-than sign's use will ensure that ALL duplicates will be deleted, even if there is more than 1 duplicate (say, 5)? Thanks a lot.
Also forgot to ask: Your first block of code with the solution isn't a typo, correct? You put "DELETE t1". Doesn't that mean it'll delete a whole table or something? Sorry for all the questions, this is just a bit complicated for me =)
Thanks a lot for the information!