views:

64

answers:

6

If I run the following queries each one returns quickly (0.01 sec) and gives me my desired result.

SELECT tagId FROM tag WHERE name='programming'

SELECT COUNT(DISTINCT workcode) FROM worktag WHERE tagId=123 OR tagId=124

(assume the two tagId numbers were the results from the first query)

I would like to combine these queries so I only have to run it once:

SELECT COUNT(DISTINCT workcode) FROM worktag WHERE tagId IN (SELECT tagId FROM tag WHERE name='programming')

However this query completes in about 1 min and 20 sec. I have indexes on worktag.workcode, worktag.tagId, tag.tagId, and tag.name.

If I run DESCRIBE on the queries the first two use the indexes and the second one uses the index for the subquery (on the tag table) but doesn't use any indexes on the worktag table.

Does anyone know why this might be?

NOTE: the worktag table has over 18 million records in it.

A: 

Have you tried:

SELECT COUNT(DISTINCT workcode) FROM worktag WHERE tagId IN (123, 124)

?

I'm not a MySQL expert, but it looks to me like you might be looking at a significant failure of the query optimizer.

On the other had, good for MySQL that it optimizes the OR in the second statement. I know databases that will successfully optimize the IN (), but not the OR version of the same logical request.

Larry Lustig
+2  A: 

Why don't you use a join instead of a subquery?

SELECT COUNT(DISTINCT workcode)
FROM worktag
LEFT JOIN tag
  ON worktag.tagId = tag.tagID
WHERE tag.name = 'programming'

P.S.: Seems to be reported as bug.

Zed
My coworker just came over and gave me the exact same solution. According to him, mysql can more easily optimize a JOIN as apposed to a subquery. I don't like the statement as much (subquery looks nicer to me), but it works great. Thanks.
chadgh
The sub-query is more imperative: you explicitly ask for selecting the IDs first, and then select the count based on the previous result. In my opinion this gives less freedom to the optimizer. (Your problem still seems to be a bug though, and not an EBKAC! :)
Zed
+1  A: 

A database admin told me recently, that the syntax WHERE x IN ( ... ) is a pain for the database. A join is almost always better:

SELECT COUNT(DISTINCT wt.workcode) 
  FROM worktag wt, tag t 
 WHERE wt.tagId = t.tagId 
   AND t.name='programming'
tangens
It doesn't have to be, and at least with Oracle and Sql server, it's a myth. The OP has hit some kind of bad behavior in MySQL. Your point might be correct for MySQL, there's NOTHING inherently slow with IN.
erikkallen
You need to tell your database admin that that statement is highly product dependent. Plenty of databases do better with IN than with either JOINs or OR. I can't see any particular reason why IN () would be a pain for a database, it seems to naturally offer plenty of opportunity for hitting an index.
Larry Lustig
+1  A: 
SELECT COUNT(DISTINCT workcode) 
FROM worktag 
inner join tag on worktag.tagid = tag.tagid
WHERE tag.name='programming'
Gratzy
A: 

I guess the optimizer does some bad guess. Replacing the query with an inner join might help.

erikkallen
A: 

MySQL generally doesn't do so well with subqueries, even independent ones. The posters who discussed joins are right - if you've got a choice, use a join. If you can't easily use a join (ie, foo.x in (select y from bar where y = xxx limit 10)), you're better off running the limit into a temporary IN MEMORY table and using a join on it.

If you're using MySQL a lot, use EXPLAIN and you'll see how it's using your indexes and such things.