views:

31

answers:

3

I have a mysql table (articles) with a nested index (blog_id, published), and performs poorly. I see a lot of these in my slow query logs:

- Query_time: 23.184007 Lock_time: 0.000063 Rows_sent: 380 Rows_examined: 6341 SELECT id from articles WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380;

I have trouble understanding why mysql would run through all rows with those blog_ids to figure out my top 380 rows. I would expect the whole purpose of the nested index is to speed that up. To the very least, even a naive implementation, should look-up by blog_id and get it's top 380 rows ordered by published. That should be fast, since, we can figure out the exact 200 rows, due to the nested index. And then sort the resulting 19*200=3800 rows.

If one were to implement it in the most optimal way, you would put a heap from the set of all blog-id based streams and pick the one with the max(published) and repeat it 200 times. Each operation should be fast.

I'm surely missing something since Google, Facebook, Twitter, Microsoft and all the big companies are using mysql for production purposes. Any one with experience?

Edit: Updating as per, thieger's answer. I tried index hinting, and it doesn't seem to help. Results are attached below, at the end. Mysql order by optimisation claims to address the concern theiger is raising:

I agree that MySQL might possibly use the composite blog_id-published-index, but only for the blog_id part of the query.

SELECT * FROM t1 WHERE key_part1=constant ORDER BY key_part2;

Atleast mysql seems to claim it can be used beyond just the WHERE clause (blog_id part of the query). Any help theiger?

Thanks, -Prasanna [myprasanna at gmail dot com]

CREATE TABLE IF NOT EXISTS `articles` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `category_id` int(11) DEFAULT NULL,
  `blog_id` int(11) DEFAULT NULL,
  `cluster_id` int(11) DEFAULT NULL,
  `title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `description` text COLLATE utf8_unicode_ci,
  `keywords` text COLLATE utf8_unicode_ci,
  `image_url` varchar(511) COLLATE utf8_unicode_ci DEFAULT NULL,
  `url` varchar(511) COLLATE utf8_unicode_ci DEFAULT NULL,
  `url_hash` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
  `author` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `categories` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
  `published` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `is_image_crawled` tinyint(1) DEFAULT NULL,
  `image_candidates` text COLLATE utf8_unicode_ci,
  `title_hash` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
  `article_readability_crawled` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index_articles_on_url_hash` (`url_hash`),
  KEY `index_articles_on_cluster_id` (`cluster_id`),
  KEY `index_articles_on_published` (`published`),
  KEY `index_articles_on_is_image_crawled` (`is_image_crawled`),
  KEY `index_articles_on_category_id` (`category_id`),
  KEY `index_articles_on_title_hash` (`title_hash`),
  KEY `index_articles_on_article_readability_crawled` (`article_readability_crawled`),
  KEY `index_articles_on_blog_id` (`blog_id`,`published`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=562907 ;

SELECT id from articles USE INDEX(index_articles_on_blog_id) WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380;

....
380 rows in set (11.27 sec)

explain SELECT id from articles USE INDEX(index_articles_on_blog_id) WHERE category_id = 11 AND blog_id IN (13,14,15,16,17,18,19,20,21,22,23,24,26,27,6330,6331,8269,12218,18889) order by published DESC LIMIT 380\G;
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: articles
         type: range
possible_keys: index_articles_on_blog_id
          key: index_articles_on_blog_id
      key_len: 5
          ref: NULL
         rows: 8640
        Extra: Using where; Using filesort
1 row in set (0.00 sec)
+2  A: 

Did you try EXPLAIN to see whether your index is used at all? Did you ANALYZE to update the index statistics?

I agree that MySQL might possibly use the composite blog_id-published-index, but only for the blog_id part of the query. If the index is not used after ANALYZE, you can try giving MySQL a hint with USE INDEX or even FORCE INDEX, but the MySQL optimizer may also correctly assume that a sequential scan is faster than using the index. For your kind of query, I would also propose to add an index on category_id and blog_id and try to use that.

thieger
Also I forgot to mention that, blog_id has a unique association with cateogory_id, and the category_id = xxx part of the query can be removed. So it seems to not make sense to include category_id in any indexing.
Prasanna
Also updated the question with edits. Please take a look. Thanks for the response.
Prasanna
As to the "unique association with category_id", I'm not sure what you mean, but if MySQL doesn't know about it, it doesn't matter anyway. As to order by using the index, think again: if the index is ordered by blog_id, and then by published, and you ask MySQL to select ranges of records with several blog_ids, then the result cannot be already ordered by published, so it has to be sorted again. But I'm also puzzled by your EXPLAIN output, with MySQL claiming that it uses the index but still considers all records -- or are there more than 8000 records? In the first output it were only 6000.
thieger
Just to see the extreme case of this, If I do LIMIT 1, would mysql fetch all the thousands of rows and sort them? when you intersect with blog_id, the extra information you have is the ordering of published. But seems like mysql is not doing that. Anyways, I'll mark this question as answered. Thnx, Cheers.
Prasanna
+1  A: 

Aside from thieger's excellent answer, you might also want to check:

  • if an index on (category_id,blog_id,published) is any use.
  • if there is enough room to keep all indexes in memory (innodb buffer pool usage & flushes for instance, mysqlreport is a very handy tool in that respect)
Wrikken
Also I forgot to mention that, blog_id has a unique association with cateogory_id, and the category_id = xxx part of the query can be removed. So it seems to not make sense to include category_id in any indexing. I have updated the question, please take a look. Thanks.
Prasanna
So, what does the query do without the category_id? And how is your innodb key status?
Wrikken
A: 

MySQL has a cutoff mechanism where if it detects that it will probably have to look at more than about a third of the table anyway, it won't use the index. Since it appears your query will match just over 6000 rows of an 8000-odd row table, that is definitely what is happening.

In addition, MySQL can't usually use an index twice on the same table, nor can it use more than one. In this case, it won't use the index for the ORDER BY clause because it has different columns specified than in the WHERE clause.

staticsan
As can be seen from Prasanna's edit, MySQL in fact uses the index. (And apart from him saying that MySQL examines all rows -- about the 6000 -- we don't know how many rows the table has). And, as Prasanna has correctly pointed out, there are cases where an index can be used for both the where and the order by part. It just seems this is not such a query, probably because "in (...)" is not a constant in the sense required here.
thieger
Ah... yes, you're right: it is using the index. I also misread the 'rows' column in the Explain. And I actually agreed with Prasanna's link about the index. As his query stands, MySQL won't use the index in the `ORDER BY` clause. I may have not been as clear as possible on this. IME, most people with this sort of problem take a while to realize the `ORDER BY` needs to reference the same columns as in the `WHERE` clause for it to use the index for sorting.
staticsan