ansaurus

Question

Answer 1

+5 A:

Try an outer join (I think MySQL supports them now) instead of a not in:

create table fc as 
select f.threadid
     , f.title
     , f.body
     , f.date
     , f.userlogin 
from f 
left outer join ft 
  on f.pid = ft.pid 
where ft.pid is null 
order by date

Adam Ruth 2010-04-13 00:49:52

this is equivalent to the original one. Mysql will optimize it.

sza 2010-04-13 00:55:17

@ziang - you wish. mysql has been notorious for not. But you can't know that without testing it. I've had much better luck with this technique.

le dorfier 2010-04-13 02:07:33

this is such a simple query with one join...

sza 2010-04-13 02:14:19

Answer 2

+1 A:

Add an clustered index on pid on both fc and ft tables.

sza 2010-04-13 00:50:25

I don't know how to do that? I did a quick Google search and couldn't find much info on how to go about doing it...

HipHop-opatamus 2010-04-13 01:12:41

CREATE INDEX idx_pid ON fc(pid);CREATE INDEX idx_pid ON ft(pid);

sza 2010-04-13 02:05:26

Well, you don't need an index on fc because that's the table you're creating. You might need an index on f.pid and/or ft.pid based on what your explain plan results tell you (see comment below). Here's how to create an index. http://dev.mysql.com/doc/refman/5.0/en/create-index.html

Ed Lucas 2010-04-13 02:06:58

That's not a clustered index statement. And default mysql tables don't support clustered indexes.

le dorfier 2010-04-13 02:09:36

Then use one of UNIQUE, FULLTEXT, SPATIAL

sza 2010-04-13 02:12:38

Clustered is only be the primary key in InnoDB tables in MySQL

newtover 2010-04-13 06:59:36

Answer 3

+2 A:

Start with EXPLAIN PLAN to see what the optimizer says. Then re-run it when you make changes to see if they help.

I'll bet the right query will run in minutes.

duffymo 2010-04-13 01:11:17

I have never used the EXPLAIN PLAN command either - according to this, it needs to be run on a select statement? http://dev.mysql.com/doc/refman/5.0/en/explain.html

HipHop-opatamus 2010-04-13 01:15:30

Explain plan simply gives you a glimpse into how the DB parses and executes the SQL query. You're looking for index usage and not table scans. So...compare the explain plan results of your original statement (everything from "select..." to the end) with Adam's statement (from "select" to the end), and see if anything major pops out at you as being really good or bad.

Ed Lucas 2010-04-13 02:01:51

+1 for studying the query plan analyzer to understand what is happening.

Pascal Thivent 2010-04-13 02:02:17

@Ed Lucas - are you saying that EXPLAIN PLAN output won't change before and after adding an index? I think a table scan might be a good indication of why a query is running for a long time, especially when the number of rows increases.

duffymo 2010-04-13 02:11:55

@HipHop - yes, of course, so run it on everything after the CREATE TABLE.

duffymo 2010-04-13 02:13:03

Great thanks - Just ran it, output of EXPLAIN PLAN can be found here:http://pastebin.com/cbZQu02V . The biggest difference I can see is under Extra for adam's code it lists "Using temporary; Using filesort", while on the original it lists "Using where; Using filesort". I assume using temporary is faster?

HipHop-opatamus 2010-04-13 02:15:44

Answer 4

A:

Make sure you have a pid index on ft. It sounds like you are getting the full cross product instead of a join by index.

Keith Randall 2010-04-13 01:58:02

Yup - there is an index on FT (created w/ CREATE INDEX idx_pid ON ft(pid) )

HipHop-opatamus 2010-04-13 02:31:47

Answer 5

A:

There can be some hidden costs. How long does it take to run this:

SELECT  count(*)
FROM f 
WHERE pid 
    NOT IN (SELECT pid FROM ft);

If it doesn't take long, then your command's slowness may be MySQL duplicating all the data as the statement executes just in case it fails and has to roll it back. (I've seen this with SQL Server.)

Also: is it any different if you take out the ORDER BY clause?

egrunin 2010-04-13 04:25:55

Answer 6

A:

How many rows in f won't match a row in ft? In the most extreme case, if pid is unique in f your target table fc will contain >1.6m rows. If the bulk of the rows will end up in fc you would be better off doing this in two stages:

CREATE TABLE fc AS 
SELECT  threadid,
    title,
    body,
    date,
    userlogin
FROM f
ORDER BY date;

DELETE FROM fc
WHERE pid 
     IN (SELECT pid FROM ft);

Incidentally, can you ditch the ORDER BY clause? That sort could cost a lot of cycles, again depending on how many rows there are in the target table.

Another thing to consider is the EXISTS clause...

CREATE TABLE fc AS 
SELECT  threadid,
    title,
    body,
    date,
    userlogin
FROM f 
WHERE NOT EXISTS 
    (SELECT pid FROM ft 
     WHERE ft.pid = f.id)
ORDER BY date;

... or in my two-step version ...

DELETE FROM fc
WHERE EXISTS
     (SELECT pid FROM ft 
 WHERE ft.pid = f.id);

EXISTS can be a lot faster than IN when the sub-query generates a lot of rows. However, as is always the case with tuning, benchmarking is key.

APC 2010-04-13 06:06:36

ansaurus

tags:

views:

answers:

Optimize this MySQL query?

related questions