views:

156

answers:

2

This question is really just out of curiosity. Can someone shed some light onto why this query:

insert into c (
    select a.* from a,b
    where a.x=8 and y=1 and b.z=a.z and active > '2010-01-07 00:00:00'
    group by a.z
)

is a million times faster (or close enough) than the query below whenever there's quite a few records involved?

insert into c (
    select * from a
    where x=8 and y=1 and z in (
        select z from b where x=8 and active > '2010-01-07 00:00:00' group by z
    )
)

What I mean is that a query joining a table is so much faster than using IN, but why is this?

+2  A: 

Becuase the subquery is being performed once for every row in the enclosing query, and the join is only performed once.

Databases are optimised for set based queries, so most of the time joins will be faster than subqueries.

You are certainly not the only one to experience this (here is one example). Looks like the query optimizer for MySql doesn't optimise subqueries of this type.

Oded
When joining tables does it not do the same thing?
Tim
When doing a join you get the corresponding rows from both tables. When using a subquery, the second table can be fully re-queried for each row.
Oded
Tim, you're right that they're equivalent. And some other database engines can perform subquery-to-join transformations on queries when the query is compiled/prepared. That would change your 2nd query into the 1st. I know DB2 does this, so probably Oracle, MS SQL Server, etc. PostgreSQL might too.
Harold L
I can confirm that SQL Server does this.
Quick Joe Smith
It is a classic; see MySQL Bug #9090: http://bugs.mysql.com/bug.php?id=9090. Problem is that some light bulb has decided that it is the correct behaviour.
Jacco
+1  A: 

In response to Tim's comment, no joins are very different.

Let us say that table a has 10,000 rows, and table b has 1,000 rows. The subquery approach runs the inner query for every row of the outer query. That results in 10,000 x 1,000 = 10,000,000 rows to process.

Depending on the fields involved in the join (and whether they are indexed), the join may only have to pass over each table once, resulting in ~11,000 rows processed. It may even only have to read each row of the smaller table and discard the remainder of the larger table resulting in even fewer reads.

It all depends on the join algorithm and the database engine.

Quick Joe Smith
So if I want to delete records from a table based on query in another table, is there a faster way than doing "delete from table1 where x in (select y from table2)"
Tim
I can't say for sure in MySQL. The only alternative I can think of is (for complex conditions) to insert the delete candidates into a temporary table and do a "delete from table1 where x in (select x from tmptbl)".
Quick Joe Smith
@Tim: Yes mysql has MULTI DELETE support with basic join capabilities, ex: delete f.* from foo f, bar b where f.id=b.id;
ggiroux