I am having trouble optimizing a relatively simple query involving a GROUP BY, ORDER BY, and LIMIT. The table has just over 300,000 records. Here's the schema (I added some extra indexes to experiment with):
CREATE TABLE `scrape_search_results` (
`id` int(11) NOT NULL auto_increment,
`creative_id` int(11) NOT NULL,
`url_id` int(11) NOT NULL,
`access_date` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `creative_url_index` (`creative_id`,`url_id`),
KEY `access_date_index` (`access_date`),
KEY `access_date_creative_id_index` (`access_date`,`creative_id`),
KEY `creative_id_access_date_index` (`creative_id`,`access_date`),
KEY `test_index` USING HASH (`creative_id`)
) ENGINE=MyISAM AUTO_INCREMENT=4252725 DEFAULT CHARSET=latin1
In the table, a single creative_id
may appear multiple (hundreds) of times. The query I am trying to answer is a relatively simple one; give me the first 20 creative_ids
ordered by access_date
. Here's my SQL:
SELECT `ScrapeSearchResult`.`creative_id`,
MAX(`ScrapeSearchResult`.`access_date`) AS `latest_access_date`
FROM `scrape_search_results` AS `ScrapeSearchResult`
WHERE 1 = 1
GROUP BY `ScrapeSearchResult`.`creative_id`
ORDER BY `latest_access_date` DESC
LIMIT 20;
Here's the results of executing this query, where we see the 20th largest access_date
is 2010-08-23 11:03:25:
+-------------+---------------------+
| creative_id | latest_access_date |
+-------------+---------------------+
| 550 | 2010-08-23 11:07:49 |
| 4568 | 2010-08-23 11:07:49 |
| 552 | 2010-08-23 11:07:49 |
| 2109 | 2010-08-23 11:07:49 |
| 5221 | 2010-08-23 11:07:49 |
| 1544 | 2010-08-23 11:07:49 |
| 1697 | 2010-08-23 11:07:49 |
| 554 | 2010-08-23 11:07:12 |
| 932 | 2010-08-23 11:05:48 |
| 11029 | 2010-08-23 11:05:37 |
| 11854 | 2010-08-23 11:05:27 |
| 11856 | 2010-08-23 11:05:05 |
| 702 | 2010-08-23 11:03:56 |
| 4319 | 2010-08-23 11:03:56 |
| 7159 | 2010-08-23 11:03:56 |
| 10610 | 2010-08-23 11:03:46 |
| 5540 | 2010-08-23 11:03:46 |
| 1 | 2010-08-23 11:03:46 |
| 11942 | 2010-08-23 11:03:35 |
| 7900 | 2010-08-23 11:03:25 |
+-------------+---------------------+
If I was going to write this algorithm by hand, I would build a b-tree ordered on (access_date
, creative_id
). I'd start at the MAX(access_date)
and keep walking the tree until I found 20 unique creative_ids
, which I would then return in the order I found them in.
Using that algorithm, I would need to consider just 94 rows (there are 94 rows for which access_date >= 2010-08-23 11:03:25
, which is our 20th largest access_date
as shown above).
However, MySQL decides to use creative_url_index
when answering this query, which I don't understand. It considers over 10,000 rows when doing this.
ANALYZE TABLE scrape_search_results;
SELECT ...;
+----+-------------+--------------------+-------+---------------+--------------------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+--------------------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | ScrapeSearchResult | index | NULL | creative_url_index | 8 | NULL | 10687 | Using temporary; Using filesort |
+----+-------------+--------------------+-------+---------------+--------------------+---------+------+-------+---------------------------------+
Is my trouble that I am performing an ORDER BY on the derived-column MAX(access_date)
? If so, how can I optimize my query to perform more in-line with my expectations?