Note that the query you show will delete both duplicates. I would assume you want to keep one or the other.
Here's how I would write this query:
DELETE t1 FROM table1 AS t1 JOIN table1 AS t2
ON t1.id > t2.id AND t1.field_name = t2.field_name;
By using greater-than instead of not-equals-to, you only delete one row (the later one), instead of both.
A compound index over (id, field_name) may help. You should confirm this with MySQL's EXPLAIN
to get the optimization report. But EXPLAIN
only supports SELECT
queries so you should run an equivalent SELECT
to confirm the optimization:
EXPLAIN SELECT * FROM table1 AS t1 JOIN table1 AS t2
ON t1.id > t2.id AND t1.field_name = t2.field_name;
You also asked about testing. I'd recommend copying a sample of rows containing duplicates to a table in your test
database:
CREATE TABLE test.table1test SELECT * FROM realdb.table1 LIMIT 10000;
Now you can perform experiments on your sample data until you're satisfied the DELETE
solution is correct.
USE test;
SET autocommit = 0;
DELETE ...
ROLLBACK;
I'd recommend naming your scratch table in the test
database something distinct from your real table in your real database. Just in case you run an experimental DELETE
while you are accidentally still using your real database as the default database!
Re your comments:
USE test
is a mysql client builtin command. It sets the test
database as the default database. This will be the default database when you name tables in your queries without qualifying them with a database name. See http://dev.mysql.com/doc/refman/5.1/en/use.html
SET autocommit = 0
turns off the default behavior of committing a transaction for each query implicitly. So you must explicitly give the COMMIT
or ROLLBACK
command to finish a transaction. See http://dev.mysql.com/doc/refman/5.1/en/commit.html
It's worthwhile to use ROLLBACK
when you're experimenting because it discards the changes made in that transaction. It's a quick way to return to the initial state of your test data so you can try another experiment.
DELETE t1
is not a typo. DELETE
deletes rows, not whole tables. t1
is an alias to each row that satisfies the conditions of the statement (although it is possible that the conditions include every row in the table). See description of multi-table delete at http://dev.mysql.com/doc/refman/5.1/en/delete.html
Sort of like when you run a loop in PHP and you use a variable to iterate over the loop: for ($i=0; $i<100; ++$i)
... The variable $i
takes on a series of values, and each time through the loop it has a different value.
Here's a demo showing how my solution deletes multiple duplicates. I ran this in my test
database and I'm pasting the result directly from my command window:
mysql> create table table1 (id serial primary key, field_name varchar(10));
Query OK, 0 rows affected (0.45 sec)
mysql> insert into table1 (field_name)
values (42), (42), (42), (42), (42), (42);
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0
mysql> select * from table1;
+----+------------+
| id | field_name |
+----+------------+
| 1 | 42 |
| 2 | 42 |
| 3 | 42 |
| 4 | 42 |
| 5 | 42 |
| 6 | 42 |
+----+------------+
6 rows in set (0.00 sec)
mysql> delete t1 from table1 t1 join table1 t2
on t1.id > t2.id and t1.field_name = t2.field_name;
Query OK, 5 rows affected (0.00 sec)
mysql> select * from table1;
+----+------------+
| id | field_name |
+----+------------+
| 1 | 42 |
+----+------------+
1 row in set (0.00 sec)