tags:

views:

489

answers:

2

I am trying to better understand why this query optimization is so significant (over 100 times faster) so I can reuse similar logic for other queries.

Using MySQL 4.1 - RESET QUERY CACHE and FLUSH TABLES was done before all queries and result time can be reproduced consistently. Only thing that is obvious to me on the EXPLAIN is that only 5 rows have to be found during the JOIN ? But is that the whole answer to the speed? Both queries are using a partial index (forum_stickies) to determine deleted topics status (topic_status=0)

Screenshots for deeper analysis with EXPLAIN

slow query: 0.7+ seconds (cache cleared)

SELECT SQL_NO_CACHE forum_id, topic_id FROM bb_topics 
WHERE topic_last_post_id IN 
(SELECT SQL_NO_CACHE  MAX (topic_last_post_id) AS topic_last_post_id
FROM bb_topics WHERE topic_status=0 GROUP BY forum_id)

fast query: 0.004 seconds or less (cache cleared)

SELECT SQL_NO_CACHE forum_id, topic_id FROM bb_topics AS s1 
JOIN 
(SELECT SQL_NO_CACHE MAX(topic_last_post_id) AS topic_last_post_id
FROM bb_topics WHERE topic_status=0 GROUP BY forum_id) AS s2 
ON s1.topic_last_post_id=s2.topic_last_post_id

Note there is no index on the most important column (topic_last_post_id) but that cannot be helped (results are stored for repeated use anyway).

Is the answer simply because the first query has to scan topic_last_post_id TWICE, the second time to match up the results to the subquery? If so, why is it exponentially slower?

(less important I am curious why the first query still takes so long if I actually do put an index on topic_last_post_id)

update: I found this thread on stackoverflow after much searching later on which goes into this topic http://stackoverflow.com/questions/141278/subqueries-vs-joins

+2  A: 

Maybe the engine executes the subquery for every row in bb_topics, just to see if it finds the topic_last_post_id in the results. Would be stupid, but would also explain the huge difference.

ammoQ
Wow that might be possible. I had only considered that maybe it does the query for every one of the id's in the group results (5 of them) but now that you mention it, I wonder if it does it for all 209 (or even worse 293) rows.I sent a request to someone to try the queries on a much larger dataset (10,000 rows vs 300) so I see if the problem gets even more magnified which would prove the theory.
_ck_
It just occured to me to also try this simply query `SELECT SQL_NO_CACHE forum_id, topic_id FROM bb_topics WHERE topic_last_post_id IN(1516,1567,1572,1569,1578)` and it's extremely fast. So you are right, it's executing the subquery for every single row, wow that's crazy.
_ck_
A: 

I would say since the argument inside the IN () clause can be whatever you stick in there, the DB has to check everything that is returned. When you join up tables, there are many performance enhancing tactics that are employed, for instance it probably uses indexes to it's advantage.

CLR