views:

140

answers:

3

I have a query to select from another sub-query select. While the two queries look almost the same the second query (in this sample) runs much slower:

SELECT
   user.id
  ,user.first_name
  -- user.*
  FROM user
  WHERE
    user.id IN (SELECT ref_id 
                  FROM education 
                 WHERE ref_type='user' 
                   AND education.institute_id='58' 
                   AND education.institute_type='1'
                );

This query takes 1.2s Explain on this query results:

 id select_type table   type    possible_keys   key key_len ref rows    Extra
 1  PRIMARY         user   index    first_name  152 141192  Using where; Using index
 2  DEPENDENT SUBQUERY  education   index_subquery  ref_type,ref_id,institute_id,institute_type,ref_type_2  ref_id  4   func    1   Using where

The second query:

SELECT
  -- user.id
  -- user.first_name
  user.*
  FROM user
  WHERE
    user.id IN (SELECT ref_id 
                  FROM education 
                 WHERE ref_type='user' 
                   AND education.institute_id='58' 
                   AND education.institute_type='1'
                );

Takes 45sec to run, with explain:

 id select_type table   type    possible_keys   key key_len ref rows    Extra
 1  PRIMARY user    ALL                 141192  Using where
 2  DEPENDENT SUBQUERY  education   index_subquery  ref_type,ref_id,institute_id,institute_type,ref_type_2  ref_id  4   func    1   Using where

Why is it slower if i query only by index fields? Why both queries scans the full length of the user table? Any ideas how to improve?

Thanks.

+4  A: 

I'm not sure why it chooses to use the index when you select only two columns but not when you select all columns, but it is better to select only the columns you need anyway. Also it might be better to try a JOIN instead of a subquery:

SELECT
    user.id
    user.first_name
FROM user
JOIN education 
ON user.id = education.ref_id 
AND education.ref_type='user' 
AND education.institute_id='58' 
AND education.institute_type='1'
Mark Byers
+2  A: 

I have had several occasions where replacing "WHERE foo in (subquery)" with dumping the results of the subquery into a temporary table and using an inner join seriously improved the performance. (Like, a 6.5 minute query turning into a sub-second query.)

Which, er, is what Mark Byers just said.

Frank Shearar
+1  A: 

This is what I think happens:

The query planner will turn the query into an inner join, which gives the database the freedom to start from either table when filtering out the result.

When you only select a few fields from the user table, the result from both tables are small, so the database can choose which table will filter the other from what's most efficient depending on what indexes can be used.

When you fetch all data from the user table, you are forcing it to use the education table to filter the user table as the intermediate result would be too large the other way around. There is no index that fits for matching that way, so you get a table scan which slows down the query.

(Excuse me if some of the terminology is coloured from SQL Server, that's what I regularly use.)

Guffa
Thanks. That's explains clearly. Can i as what do you suggest is better to do? How can i force it to first select the 'education' table and then use the results over the user table?
aviv
@aviv: No, you can't really force it either way, that would probably result in even worse performance anyway. What you can do is to add an index that it can use with the current query.
Guffa