views:

99

answers:

6

The following query takes FOREVER to execute (30+ hrs on a Macbook w/4gig ram) - I'm looking for ways to make it run more efficiently. Any thoughts are appreciated!

CREATE TABLE fc AS 
SELECT  threadid,
    title,
    body,
    date,
    userlogin
FROM f 
WHERE pid 
    NOT IN (SELECT pid FROM ft) ORDER BY date;

(table "f" is ~1 Gig / 1,843,000 row, table "ft" is 168mb, 216,000 rows) )

+5  A: 

Try an outer join (I think MySQL supports them now) instead of a not in:

create table fc as 
select f.threadid
     , f.title
     , f.body
     , f.date
     , f.userlogin 
from f 
left outer join ft 
  on f.pid = ft.pid 
where ft.pid is null 
order by date
Adam Ruth
this is equivalent to the original one. Mysql will optimize it.
sza
@ziang - you wish. mysql has been notorious for not. But you can't know that without testing it. I've had much better luck with this technique.
le dorfier
this is such a simple query with one join...
sza
+1  A: 

Add an clustered index on pid on both fc and ft tables.

sza
I don't know how to do that? I did a quick Google search and couldn't find much info on how to go about doing it...
HipHop-opatamus
CREATE INDEX idx_pid ON fc(pid);CREATE INDEX idx_pid ON ft(pid);
sza
Well, you don't need an index on fc because that's the table you're creating. You might need an index on f.pid and/or ft.pid based on what your explain plan results tell you (see comment below). Here's how to create an index. http://dev.mysql.com/doc/refman/5.0/en/create-index.html
Ed Lucas
That's not a clustered index statement. And default mysql tables don't support clustered indexes.
le dorfier
Then use one of UNIQUE, FULLTEXT, SPATIAL
sza
Clustered is only be the primary key in InnoDB tables in MySQL
newtover
+2  A: 

Start with EXPLAIN PLAN to see what the optimizer says. Then re-run it when you make changes to see if they help.

I'll bet the right query will run in minutes.

duffymo
I have never used the EXPLAIN PLAN command either - according to this, it needs to be run on a select statement? http://dev.mysql.com/doc/refman/5.0/en/explain.html
HipHop-opatamus
Explain plan simply gives you a glimpse into how the DB parses and executes the SQL query. You're looking for index usage and not table scans. So...compare the explain plan results of your original statement (everything from "select..." to the end) with Adam's statement (from "select" to the end), and see if anything major pops out at you as being really good or bad.
Ed Lucas
+1 for studying the query plan analyzer to understand what is happening.
Pascal Thivent
@Ed Lucas - are you saying that EXPLAIN PLAN output won't change before and after adding an index? I think a table scan might be a good indication of why a query is running for a long time, especially when the number of rows increases.
duffymo
@HipHop - yes, of course, so run it on everything after the CREATE TABLE.
duffymo
Great thanks - Just ran it, output of EXPLAIN PLAN can be found here:http://pastebin.com/cbZQu02V . The biggest difference I can see is under Extra for adam's code it lists "Using temporary; Using filesort", while on the original it lists "Using where; Using filesort". I assume using temporary is faster?
HipHop-opatamus
A: 

Make sure you have a pid index on ft. It sounds like you are getting the full cross product instead of a join by index.

Keith Randall
Yup - there is an index on FT (created w/ CREATE INDEX idx_pid ON ft(pid) )
HipHop-opatamus
A: 

There can be some hidden costs. How long does it take to run this:

SELECT  count(*)
FROM f 
WHERE pid 
    NOT IN (SELECT pid FROM ft);

If it doesn't take long, then your command's slowness may be MySQL duplicating all the data as the statement executes just in case it fails and has to roll it back. (I've seen this with SQL Server.)

Also: is it any different if you take out the ORDER BY clause?

egrunin
A: 

How many rows in f won't match a row in ft? In the most extreme case, if pid is unique in f your target table fc will contain >1.6m rows. If the bulk of the rows will end up in fc you would be better off doing this in two stages:

CREATE TABLE fc AS 
SELECT  threadid,
    title,
    body,
    date,
    userlogin
FROM f
ORDER BY date;

DELETE FROM fc
WHERE pid 
     IN (SELECT pid FROM ft);

Incidentally, can you ditch the ORDER BY clause? That sort could cost a lot of cycles, again depending on how many rows there are in the target table.

Another thing to consider is the EXISTS clause...

CREATE TABLE fc AS 
SELECT  threadid,
    title,
    body,
    date,
    userlogin
FROM f 
WHERE NOT EXISTS 
    (SELECT pid FROM ft 
     WHERE ft.pid = f.id)
ORDER BY date;

... or in my two-step version ...

DELETE FROM fc
WHERE EXISTS
     (SELECT pid FROM ft 
 WHERE ft.pid = f.id);

EXISTS can be a lot faster than IN when the sub-query generates a lot of rows. However, as is always the case with tuning, benchmarking is key.

APC