views:

39

answers:

1

I am having trouble optimizing a relatively simple query involving a GROUP BY, ORDER BY, and LIMIT. The table has just over 300,000 records. Here's the schema (I added some extra indexes to experiment with):

CREATE TABLE `scrape_search_results` (
  `id` int(11) NOT NULL auto_increment,
  `creative_id` int(11) NOT NULL,
  `url_id` int(11) NOT NULL,
  `access_date` datetime NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `creative_url_index` (`creative_id`,`url_id`),
  KEY `access_date_index` (`access_date`),
  KEY `access_date_creative_id_index` (`access_date`,`creative_id`),
  KEY `creative_id_access_date_index` (`creative_id`,`access_date`),
  KEY `test_index` USING HASH (`creative_id`)
) ENGINE=MyISAM AUTO_INCREMENT=4252725 DEFAULT CHARSET=latin1

In the table, a single creative_id may appear multiple (hundreds) of times. The query I am trying to answer is a relatively simple one; give me the first 20 creative_ids ordered by access_date. Here's my SQL:

SELECT `ScrapeSearchResult`.`creative_id`, 
        MAX(`ScrapeSearchResult`.`access_date`) AS `latest_access_date` 
FROM `scrape_search_results` AS `ScrapeSearchResult` 
WHERE 1 = 1 
GROUP BY `ScrapeSearchResult`.`creative_id` 
ORDER BY `latest_access_date` DESC 
LIMIT 20;

Here's the results of executing this query, where we see the 20th largest access_date is 2010-08-23 11:03:25:

+-------------+---------------------+
| creative_id | latest_access_date  |
+-------------+---------------------+
|         550 | 2010-08-23 11:07:49 | 
|        4568 | 2010-08-23 11:07:49 | 
|         552 | 2010-08-23 11:07:49 | 
|        2109 | 2010-08-23 11:07:49 | 
|        5221 | 2010-08-23 11:07:49 | 
|        1544 | 2010-08-23 11:07:49 | 
|        1697 | 2010-08-23 11:07:49 | 
|         554 | 2010-08-23 11:07:12 | 
|         932 | 2010-08-23 11:05:48 | 
|       11029 | 2010-08-23 11:05:37 | 
|       11854 | 2010-08-23 11:05:27 | 
|       11856 | 2010-08-23 11:05:05 | 
|         702 | 2010-08-23 11:03:56 | 
|        4319 | 2010-08-23 11:03:56 | 
|        7159 | 2010-08-23 11:03:56 | 
|       10610 | 2010-08-23 11:03:46 | 
|        5540 | 2010-08-23 11:03:46 | 
|           1 | 2010-08-23 11:03:46 | 
|       11942 | 2010-08-23 11:03:35 | 
|        7900 | 2010-08-23 11:03:25 | 
+-------------+---------------------+

If I was going to write this algorithm by hand, I would build a b-tree ordered on (access_date, creative_id). I'd start at the MAX(access_date) and keep walking the tree until I found 20 unique creative_ids, which I would then return in the order I found them in.

Using that algorithm, I would need to consider just 94 rows (there are 94 rows for which access_date >= 2010-08-23 11:03:25, which is our 20th largest access_date as shown above).

However, MySQL decides to use creative_url_index when answering this query, which I don't understand. It considers over 10,000 rows when doing this.

ANALYZE TABLE scrape_search_results;
SELECT ...;
+----+-------------+--------------------+-------+---------------+--------------------+---------+------+-------+---------------------------------+
| id | select_type | table              | type  | possible_keys | key                | key_len | ref  | rows  | Extra                           |
+----+-------------+--------------------+-------+---------------+--------------------+---------+------+-------+---------------------------------+
|  1 | SIMPLE      | ScrapeSearchResult | index | NULL          | creative_url_index | 8       | NULL | 10687 | Using temporary; Using filesort | 
+----+-------------+--------------------+-------+---------------+--------------------+---------+------+-------+---------------------------------+

Is my trouble that I am performing an ORDER BY on the derived-column MAX(access_date)? If so, how can I optimize my query to perform more in-line with my expectations?

+2  A: 

I haven't done this sort of thing in MySQL for a while (long since switched to PostgtreSQL) but typically I would handle this with concentric selects to trick the query planner into giving a good plan.

SELECT * FROM 
(SELECT `ScrapeSearchResult`.`creative_id`, 
        MAX(`ScrapeSearchResult`.`access_date`) AS `latest_access_date` 
FROM `scrape_search_results` AS `ScrapeSearchResult` 
WHERE 1 = 1 
GROUP BY `ScrapeSearchResult`.`creative_id` 

) as inner
ORDER BY `latest_access_date` DESC 
LIMIT 20;

The success of this will purely depend on a reasonable number of total rows in the inner though.

I just looked up the docs for MySQL 5.6 and it looks like this should work ... even in MySQL ;)

Trey
@Trey: Excellent, your suggestion brought the time down from 0.44 sec to 0.17 sec. The new query plan built a temporary table and then queried it. The `creative_id_access_date_index` column was used (again, searching 10687 rows but not using temporary or filesort, and instead using-index-for-group-by), which I believe accounts for the speedup. Any other suggestions to bring the number of rows considered down at all, for an even greater boost in speed?
Rob Crowell
Didn't want to mess up the comment above; here's the new EXPLAIN output:\n1 PRIMARY <derived2> ALL NULL NULL NULL NULL 10812 Using filesort\n2 DERIVED ScrapeSearchResult range NULL creative_id_access_date_index 4 NULL 10687 Using index for group-by
Rob Crowell
As long as you have no WHERE condition it's going to do a full table scan. If you could put any sort of condition on the query at all it will help considerably. My query will only help speed up the sort ... not so much reduce the number of rows scanned.
Trey
I don't know how often you are adding / pruning data but you might want to make a summary table and select from there (where you only have the results of the inner query). You can maintain with code or with insert / update triggers. That will cause all sorting to be done with an appropriate index (if you add one).
Trey
In mysql, group by also does an "order by" on the same columns. Perhaps you would want to eliminate that, then modify it this way: `GROUP BY ScrapeSearchResult.creative_id order by null ) as inner `
ceteras
@ceteras: Interesting, didn't know this about MySQL at all. It made about 0.9 s difference when I explicitly ordered by NULL. Thanks!
Rob Crowell