views:

1530

answers:

12

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.

The goal is to remove all the rows in table A that have a matching id in table B.

I'm currently doing the following:

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);

There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.

This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?


EDIT:

Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:

CREATE TABLE b LIKE a;

Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.

Also, I'm using INNO DB.


Here is some additional info that might be helpful to you.

Table A has a structure similar to this (I've sanitized this a bit):

DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE  `frobozz`.`a` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `fk_g` varchar(30) NOT NULL,
  `h` int(10) unsigned default NULL,
  `i` longtext,
  `j` bigint(20) NOT NULL,
  `k` bigint(20) default NULL,
  `l` varchar(45) NOT NULL,
  `m` int(10) unsigned default NULL,
  `n` varchar(20) default NULL,
  `o` bigint(20) NOT NULL,
  `p` tinyint(1) NOT NULL,
  PRIMARY KEY  USING BTREE (`id`),
  KEY `idx_l` (`l`),
  KEY `idx_h` USING BTREE (`h`),
  KEY `idx_m` USING BTREE (`m`),
  KEY `idx_fk_g` USING BTREE (`fk_g`),
  KEY `fk_g_frobozz` (`id`,`fk_g`),
  CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

I suspect that part of the issue is there are a number of indexes for this table. Table B looks similar to table B, though it only contains the columns id and h.

Also, the profiling results are as follows:

starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002


SOLVED

Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"

The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement

DELETE FROM c WHERE c.id = theId;

I looked into the EXPLAIN statement and rewrote this as

EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;

so I could see what this was doing and it gave me the following info:

id            1
select_type   SIMPLE
table         c
type          ALL
possible_keys NULL
key           NULL
key_len       NULL
ref           NULL
rows          2633
Extra         using where

This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:

id            1
select_type   SIMPLE
table         c
type          ref
possible_keys Index_1
key           Index_1
key_len       8
ref           const
rows          1
Extra

Much better, in fact really great.

I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.

Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.

Thanks to all.

Reference: EXPLAIN Statement

+5  A: 

Try this:

DELETE a
FROM a
INNER JOIN b
 on a.id = b.id

Using subqueries tend to be slower then joins as they are run for each record in the outer query.

Chris Pebble
Thanks for the response, Chris. This didn't seem to speed it up, though. It took right at 3 minutes as well.
itsmatt
I would've thought this answer to be an improvement over the others, but MySQL must be optimizing them all into the same statement. I find it hard to believe it "should" take that long though. I don't know much about indexes in MySQL. Does making a column the PK make it indexed as well? If not, create an explicit index on the 'id' columns of both tables.
Ross
@Ross Primary key is essentially the same as a unique index in MySQL, except that it's a special one.
mattkemp
+3  A: 

You're doing your subquery on 'b' for every row in 'a'.

Try:

DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;
Evert
Thanks for the response, Evert. This didn't seem to speed it up, though. It also took right at 3 minutes to complete.
itsmatt
What storage engine are you using?
Evert
I'm using Inno DB.
itsmatt
+2  A: 
DELETE FROM a WHERE id IN (SELECT id FROM b)
chaos
Thanks for the response, chaos. This didn't seem to speed it up, though. It took right at 3 minutes as well. Very odd that all the suggestions too the same time. Hmm...
itsmatt
The only suggestion I have, then, is dropping the keys on the table before you do the query and re-adding them afterward.
chaos
Hmm... that's interesting. I'll have to try that. I'll let you know.
itsmatt
That's a horrible suggestion as any ALTER TABLE statement, like dropping keys, will rebuild the table from scratch (if you ALTER twice, that's rebuilding twice). If your table size is decent, this query will take a long time.Additionally, it will lock any writes to the table until ALTER TABLE finishes.
Artem Russakovskii
+2  A: 

Try this out:

DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID

It is much faster than normal queries.

Refer for Syntax : http://dev.mysql.com/doc/refman/5.0/en/delete.html

Webrsk
Thanks, Webrsk. I'll give that a try. Appreciate the link.
itsmatt
Just a follow-up. I tried this as well and the execution time was essentially the same - 2 minutes, 55 seconds. So this was about 5 seconds faster for the test set.
itsmatt
DELETE QUICK is only useful in MyISAM tables, as the manual suggests. Also, it could lead to wasted indexes unless you OPTIMIZE TABLE, so I would avoid using it, unless you know exactly what you're doing.
Artem Russakovskii
+7  A: 

Your time of three minutes seems really slow. My guess is that the id column is not being indexed properly. If you could provide the exact table definition you're using that would be helpful.

I created a simple python script to produce test data and ran multiple different versions of the delete query against the same data set. Here's my table definitions:

drop table if exists a;
create table a
 (id bigint unsigned  not null primary key,
  data varchar(255) not null) engine=InnoDB;

drop table if exists b;
create table b like a;

I then inserted 100k rows into a and 25k rows into b (22.5k of which were also in a). Here's the results of the various delete commands. I dropped and repopulated the table between runs by the way.

mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (1.14 sec)

mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (0.81 sec)

mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (0.97 sec)

mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (0.81 sec)

All the tests were run on an Intel Core2 quad-core 2.5GHz, 2GB RAM with Ubuntu 8.10 and MySQL 5.0. Note, that the execution of one sql statement is still single threaded.


Update:

I updated my tests to use itsmatt's schema. I slightly modified it by remove auto increment (I'm generating synthetic data) and character set encoding (wasn't working - didn't dig into it).

Here's my new table definitions:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table c (id varchar(30) not null primary key) engine=InnoDB;

create table a (
  id bigint(20) unsigned not null primary key,
  c_id varchar(30) not null,
  h int(10) unsigned default null,
  i longtext,
  j bigint(20) not null,
  k bigint(20) default null,
  l varchar(45) not null,
  m int(10) unsigned default null,
  n varchar(20) default null,
  o bigint(20) not null,
  p tinyint(1) not null,
  key l_idx (l),
  key h_idx (h),
  key m_idx (m),
  key c_id_idx (id, c_id),
  key c_id_fk (c_id),
  constraint c_id_fk foreign key (c_id) references c(id)
) engine=InnoDB row_format=dynamic;

create table b like a;

I then reran the same tests with 100k rows in a and 25k rows in b (and repopulating between runs).

mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (11.90 sec)

mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (11.48 sec)

mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (12.21 sec)

mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (12.33 sec)

As you can see this is quite a bit slower than before, probably due to the multiple indexes. However, it is nowhere near the three minute mark.

Something else that you might want to look at is moving the longtext field to the end of the schema. I seem to remember that mySQL performs better if all the size restricted fields are first and text, blob, etc are at the end.

mattkemp
The author already said that ID was the primary key. That is indexed by default - in InnoDB it's a clustered index.
Daniel Schneller
+2  A: 

Maybe you should rebuild the indicies before running such a hugh query. Well, you should rebuild them periodically.

REPAIR TABLE a QUICK;
REPAIR TABLE b QUICK;

and then run any of the above queries (i.e.)

DELETE FROM a WHERE id IN (SELECT id FROM b)
Scoregraphic
I get 'The storage engine for the table doesn't support repair" as the Msg_text when I attempt to do that for either table. I guess this isn't supported for Inno DB. I'll have to check that though.
itsmatt
You may try "ANALYZE TABLE a" and/or "OPTIMIZE TABLE a" then. This should update the key distribution.
Scoregraphic
+2  A: 

The query itself is already in an optimal form, updating the indexes causes the whole operation to take that long. You could disable the keys on that table before the operation, that should speed things up. You can turn them back on at a later time, if you don't need them immediately.

Another approach would be adding a deleted flag-column to your table and adjusting other queries so they take that value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). That would be a fast operation, you can delete the values afterwards.

The same thoughts expressed in sql statements:

ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL;

-- The following query should be faster than the delete statement:
UPDATE a INNER JOIN b SET a.deleted = '';

-- This is the catch, you need to alter the rest
-- of your queries to take the new column into account:
SELECT * FROM a WHERE deleted IS NULL;

-- You can then issue the following queries in a cronjob
-- to clean up the tables:
DELETE FROM a WHERE deleted IS NOT NULL;

If that, too, is not what you want, you can have a look at what the mysql docs have to say about the speed of delete statements.

soulmerge
+1  A: 

Obviously the SELECT query that builds the foundation of your DELETE operation is quite fast so I'd think that either the foreign key constraint or the indexes are the reasons for your extremely slow query.

Try

SET foreign_key_checks = 0;
/* ... your query ... */
SET foreign_key_checks = 1;

This would disable the checks on the foreign key. Unfortunately you cannot disable (at least I don't know how) the key-updates with an InnoDB table. With a MyISAM table you could do something like

ALTER TABLE a DISABLE KEYS
/* ... your query ... */
ALTER TABLE a ENABLE KEYS

I actually did not test if these settings would affect the query duration. But it's worth a try.

Stefan Gehrig
Disabling foreign key checks is almost always bad advice. They are in place for a reason and disabling them for performance a) seldom produces a noticable speed improvement and more importantly b) allows anyone to wreak havoc on your data integrity while "the shields are down".
Daniel Schneller
When restoring a backup you know beeing consistent for example, disabling key checks won't do any harm (and sometimes is the only way to restore a backup). Doing so on a live system will indeed wreak havoc. If it produces speed improvements must be checked by running benchmarks.
Stefan Gehrig
+13  A: 

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.

While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.

To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.

For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.

Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.

So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows. To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record. With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

Daniel Schneller
Really informative, thanks for taking the time to write so concisely.
Jas Panesar
+1  A: 

This is what I always do, when I have to operate with super large data (here: a sample test table with 150000 rows):

drop table if exists employees_bak;
create table employees_bak like employees;
insert into employees_bak 
    select * from employees
    where emp_no > 100000;

rename table employees to employees_todelete;
rename table employees_bak to employees;

In this case the sql filters 50000 rows into the backup table. The query cascade performs on my slow machine in 5 seconds. You can replace the insert into select by your own filter query.

That is the trick to perform mass deletion on big databases!;=)

Tom Schaefer
Interesting idea, Tom. Thanks for posting it.
itsmatt
Since you're not mentioning transactions, I'm assuming you're not using any. The caveat here is that there could be a race condition if something is going to insert into employees between the time you insert and rename the table. You may lose some employees as a result. Perhaps, in your case, you prevent new employees to be inserted but as a general solution, this is something to be careful about.Also, you can batch the rename table operations to make them atomic. RENAME a to c, b to a, c to b would swap tables a and b for example.
Artem Russakovskii
Usually i lock the tables before performing such an operation. I am aware of transactions and I am still using them. But there some reasons, which I will not consider here, for performing my solution and not transactions.
Tom Schaefer
+2  A: 

I know this question has been pretty much solved due to OP's indexing omissions but I would like to offer this additional advice, which is valid for a more generic case of this problem.

I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag, as the longer each single mutator query runs, the worse the lag would be (replication is single threaded).

So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time). This will achieve the following:

  • each one of the delete statements will not lock the table for too long, thus not letting replication lag to get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
  • another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because it has to do the long select all over again. Slave lag, again, suffers badly. If you use a separate SELECT query, this problem goes away, as all you're passing is a list of IDs.

Let me know if there's a fault in my logic somewhere.

For more discussion on replication lag and ways to fight it, similar to this one, see MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It

P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.

Artem Russakovskii
+2  A: 

BTW, after posting the above on my blog, Baron Schwartz from Percona brought to my attention that his maatkit already has a tool just for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html.

It is most likely your best tool for the job.

Artem Russakovskii