views:

111

answers:

3

I have a query that looks like this:

select
id
, int1
, int2
, (select count(*) from big_table_with_millions_of_rows 
    where id between t.int1 and t.int2)
from myTable t
where
....

This select returns exactly one row. The id used in the inline select is an indexed column (primary key). If I replace t.int1 and t.int2 with the values of int1/int2 returned by this single row, the query completes in milliseconds. If I execute the query as above - i.e. with references to int1/int2, it takes about 10 minutes. When I run profiler and look at what actually happens, I see that 99% of the time the engine is busy returning data from the inline query. It looks as though MySql is actually running the

select ... from big_table_with_millions_of_rows

bit of the inline query once before applying the

where id between t.int1 and t.int2

bit to the result. Can this be true? If not, then what is going on? I had always thought that inline SELECTs were potentially hazardous because they are executed row-by-row as the last element of the query, but for situations like this, where the initial SELECT is indeed highly selective, it can be very efficient. Can anyone shed any light on this?

EDIT: thanks for the feedback so far. My concern is not so much about the row-by-row nature of the inline query, but rather the fact that it seems unable to use the primary key index when faced with variables rather than (the same) hardcoded values. My guess would be that if ANALYZE has not been run recently, then the optimizer assumes it has to do a table scan as it has no knowledge about the data distribution. But shouldn't the fact that the range lookup is done on the primary key not compensate for that?

A: 

If a subquery references fields from its containing query, the subquery has to be rerun per every row in the containing query, because the referenced fields may be different in each row. If it's completely self-contained, it can be run a single time before the outer query begins processing.

Dewayne Christensen
+1  A: 

Try to avoid correlated subqueries by using JOIN if you can.

Watch this great video on MySQL performance on youtube. Go to 31:00 minute in. The speaker Jay Pipes talks about avoiding correlated subqueries.

Yada
interesting link - thank you!
davek
+1  A: 

If the correlated subquery isn't optimized well, then try this query:

select
  t.id
, t.int1
, t.int2
, count(*)
from myTable t
left outer join big_table_with_millions_of_rows b
  on (b.id between t.int1 and t.int2)
where
....
group by t.id

That should optimize much better.


Re your updated question: Right, MySQL is not the most sophisticated RDBMS on the market in terms of optimization. Don't be surprised when MySQL can't optimize corner cases like this.

I'm a fan of MySQL for its ease of use and open source and all those good things, but the truth is that its competitors are far ahead of MySQL in terms of technology. Every RDBMS has some "blind spots" but MySQL's seem to be larger.

Also be sure you're using the latest version of MySQL. They improve the optimizer in every release, so you might get better results with a newer version.

Bill Karwin
+1 thank you: that brought the execution time down from minutes to several seconds. I'll have to bear that tip in mind in future!
davek