views:

380

answers:

6

We have a simple query that looks like:

SELECT a,b,c,d FROM table WHERE a=1 and b IN ('aaa', 'bbb', 'ccc', ...)

No joins at all, 5000 contsant values in the IN clause.

Now, this query takes 1-20 seconds to run on a very strong (16 core) server. The table has an index on (a,b), and we also tried reversing the index to (b,a). The server has tons of memory, and nobody is writing to this table - just 5 processes running selects like I described above.

We did some profiling and saw that some queries spend 3.5 seconds in "JOIN::optimize" (.\sql_select.cc 977). I remind you, the queries do not use joins at all.

What could be the cause for this large time spent optimizing joins on a join-less table?

Here is the result of EXPLAIN SELECT:

id select_type table type   possible_keys key    key_len ref rows   Extra
1  SIMPLE     table range    IX_A_B       IX_A_B 65      \N  5000   Using where
A: 

Do you have indexes on field a and especially b?

If you are asking for help in optimization SQL you should attach

EXPLAIN SELECT a,b,c,d FROM table WHERE a=1 and b IN ('aaa', 'bbb', 'ccc', ...)

as well, without it people can only guess.

bluszcz
Added explain result.
ripper234
+4  A: 

Try putting 5000 values in a temporary table:

declare @t table (b varchar(10))
insert into b select 'aaa'
union all select 'bbb'
union all select 'c'
....

select table.*
from table
join @t t on table.b = t.b
where table.a = 1
Andomar
Why would that be better?
ripper234
Two reasons it could be better; definitely worth a try. 1) Every time there is a different number of options for b, the DBMS may see it as sufficiently different to recompile (even though it's a parametrised query). This is very likely, because the way OP describes the problem, it sounds likely that the overhead is due to query recompilation. 2) By putting data into a temp table (especially if the column is indexed) you give the optimiser the option to consider various joins... a MERGE JOIN could prove quite efficient.
Craig Young
+1  A: 

b IN(x,y,...) gets translated into: (b = x OR b = y OR b = ...)

this means you have 5000 if-checks to do for each value in the table.

Viktor Klang
Maybe the server recognises this, and instead constructs a temporary table of 'aaa', 'bbb' etc and does a JOIN to that?
vincebowdren
+1  A: 

Using an IN clause like that may as well be a join, so it's not completely join-less.

It's fairly good that you have an index on (a,b), but you have to wonder how it's going to get at the values c and d... in the end, it'll probably be ignoring the index and just scanning the whole table.

Try making an index on (a,b,c,d), so that the index contains all the data you need. In SQL Server you'd do this with included columns, but I think in mysql you'd need to just include the others too. This should mean that your query can go straight to the a=1 records, and start looking through for records of b that match the list, and then it has all the information it needs.

Rob Farley
Why would it scan the entire table? The table has 25 million records - it's cheaper to find the correct record through the index and fetch them. Indeed, the explain shows that 5000 records were scanned.I am selecting the entire record (equivalent to *), so I can't build an index that contains all the queried fields.Do you think using WHERE a=1 and (b='aaa' OR b='bbb' OR b='ccc') would be better?
ripper234
It's scanning, and the scan is returning 5000 records, after looking through all of them. Most database systems don't cope very well with OR or IN. And an index on a,b means if it used that index, it could find 5000 records (seeking to the a=1 and then finding the records in there), but then it needs to fetch c,d from the table itself, which also costs. The system may well feel that this extra lookup is too much and decide to scan instead.
Rob Farley
It scans because the disk seeking required is considered very expensive. The DB may be able to find a,b quickly; but then for each a,b it needs to 'bookmark-lookup' c,d as follows: 1) Read the clustered position key 2) Read a clustered index page starting at root (disk seek) 3) Determine the next index page 4) If not at data page, goto 2. Disk seek is considered **extremely expensive** when compared to sequentially loading all data into memory and performing a linear search.
Craig Young
A: 

Our DBA found this as a reported bug.

ripper234
Yes it's ***reported*** as a bug; but consider this: you're effectively asking the opimiser to evaluate a 5000 OR condition query! From the link you provided: **[17 Aug 2007 20:43] Igor Babaev** said: "This is not a bug: for some queries ANY DBMS can be forced to spend for the optimizerphase much more than for the execution phase." Also: "Any query that requires long range optimization can be always rewritten to avoid this problem (e.g.: the predicate a IN <long list> can be replaced for a+0 IN <long list>."
Craig Young
I refer you AGAIN to the following answer: http://stackoverflow.com/questions/2153799/mysql-takes-a-long-time-optimizing-a-join-less-query/2153819#2153819
Craig Young
The query is simple - there are 5000 constant values in an IN list. It should not take a long time to optimize. I consulted with said DBA, and he doesn't make much sense from that answer. It's possible it will work, but before that I will simply try smaller batch sizes.
ripper234
A: 

Your answer will be to consider the suggestions in both of the following answers:
http://stackoverflow.com/questions/2153799/mysql-takes-a-long-time-optimizing-a-join-less-query/2153819#2153819
http://stackoverflow.com/questions/2153799/mysql-takes-a-long-time-optimizing-a-join-less-query/2153860#2153860

In addition, you mentioned that b is highly selective; so:

I suggest you change the order of the columns in your index to (b, a). If the optimser can narrow down on your results more quickly, it will be keener to use the index. (It's generally a good rule of thumb to put your most selective columns earlier in your indexes; rarely if ever would you want to deviate from that principle.)

Craig Young
As I wrote in the question, I tried reversing the index order and it didn't help.
ripper234
@ripper: And as I wrote in my answer - you have multiple issues to deal with. Furthermore, the guideline to place the most selective column first in your index is still a good general principle to follow; even if you don't notice an improvement because 90% of the time is spent recompiling your query whenever the number of elements in the IN clause changes.
Craig Young