views:

797

answers:

4

I've got a table of keywords that I regularly refresh against a remote search API, and I have another table that gets a row each each time I refresh one of the keywords. I use this table to block multiple processes from stepping on each other and refreshing the same keyword, as well as stat collection. So when I spin up my program, it queries for all the keywords that don't have a request currently in process, and don't have a successful one within the last 15 mins, or whatever the interval is. All was working fine for awhile, but now the keywords_requests table has almost 2 million rows in it and things are bogging down badly. I've got indexes on almost every column in the keywords_requests table, but to no avail.

I'm logging slow queries and this one is taking forever, as you can see. What can I do?

# Query_time: 20 Lock_time: 0 Rows_sent: 568 Rows_examined: 1826718

SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT JOIN `keywords_requests` as KeywordsRequest
ON (
  KeywordsRequest.keyword_id = Keyword.id
  AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
  AND KeywordsRequest.source_id = '29'
  AND KeywordsRequest.created > FROM_UNIXTIME(1234551323)
)
WHERE KeywordsRequest.id IS NULL
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;
A: 

Dunno about MySQL but in MSSQL the lines of attack I would take are:

1) Create a covering index on KeywordsRequest status, source_id and created

2) UNION the results tog et around the OR on KeywordsRequest.status

3) Use NOT EXISTS instead o the Outer Join (and try with UNION instead of OR too)

Kristen
+1  A: 

When diagnosing MySQL query performance, one of the first things you need to analyze is the report from EXPLAIN.

If you learn to read the information EXPLAIN gives you, then you can see where queries are failing to make use of indexes, or where they are causing expensive filesorts, or other performance red flags.

I notice in your query, the GROUP BY is irrelevant, since there will be only one NULL row returned from KeywordRequests. Also the ORDER BY is irrelevant, since you're ordering by a column that will always be NULL due to your WHERE clause. If you remove these clauses, you'll probably eliminate a filesort.

Also consider rewriting the query into other forms, and measure the performance of each. For example:

SELECT k.id, k.keyword
FROM `keywords` AS k
WHERE NOT EXISTS (
  SELECT * FROM `keywords_requests` AS kr
  WHERE kr.keyword_id = k.id
   AND kr.status IN ('success', 'active')
   AND kr.source_id = '29'
   AND kr.created > FROM_UNIXTIME(1234551323)
);

Other tips:

  • Is kr.source_id an integer? If so, compare to the integer 29 instead of the string '29'.
  • Are there appropriate indexes on keyword_id, status, source_id, created? Perhaps even a compound index over all four columns would be best, since MySQL will use only one index per table in a given query.


You did a screenshot of your EXPLAIN output and posted a link in the comments. I see that the query is not using an index from Keywords, which makes sense since you're scanning every row in that table anyway. The phrase "Not exists" indicates that MySQL has optimized the LEFT OUTER JOIN a bit.

I think this should be improved over your original query. The GROUP BY/ORDER BY was probably causing it to save an intermediate data set as a temporary table, and sorting it on disk (which is very slow!). What you'd look for is "Using temporary; using filesort" in the Extra column of EXPLAIN information.

So you may have improved it enough already to mitigate the bottleneck for now.

I do notice that the possible keys probably indicate that you have individual indexes on four columns. You may be able to improve that by creating a compound index:

CREATE INDEX kr_cover ON keywords_requests
  (keyword_id, created, source_id, status);

You can give MySQL a hint to use a specific index:

... FROM `keywords_requests` AS kr USE INDEX (kr_cover) WHERE ...
Bill Karwin
Wow, thanks for your help. The EXPLAIN for this query wasn't terribly helpful for me. Here's what I got after removing the GROUP BY and ORDER BY clauses...what am I looking for? http://img.skitch.com/20090213-fq5mkucasgdqc3ck9aeca6qr9p.jpg
Also, I got the most speed improvement from using Quassnoi's example below. But if I wanted to speed things up further, could a multi-column index still help me? Would I just want source_id, status, and keyword_id in the index, since created is in a subquery?
It's hard to guess, since the optimizer's choice of indexes depends in part on the distribution of data values in your database. Best advice is to try it a few different ways and measure both the EXPLAIN information as well as the actual time to run the query.
Bill Karwin
+1  A: 

It seems your most selective index on Keywords is one on KeywordRequest.created.

Try to rewrite query this way:

SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
  SELECT *
  FROM `keywords_requests` as kr
  WHERE created > FROM_UNIXTIME(1234567890) /* Happy unix_time! */
) AS KeywordsRequest
ON (
  KeywordsRequest.keyword_id = Keyword.id
  AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
  AND KeywordsRequest.source_id = '29'
)
WHERE keyword_id IS NULL;

It will (hopefully) hash join two not so large sources.

And Bill Karwin is right, you don't need the GROUP BY or ORDER BY

There is no fine control over the plans in MySQL, but you can try (try) to improve your query in the following ways:

  1. Create a composite index on (keyword_id, status, source_id, created) and make it so:

    SELECT Keyword.id, Keyword.keyword
    FROM `keywords` as Keyword
    LEFT OUTER JOIN `keywords_requests` kr
    ON (
      keyword_id = id
      AND status = 'success'
      AND source_id = '29'
      AND created > FROM_UNIXTIME(1234567890)
    )
    WHERE keyword_id IS NULL
    UNION
    SELECT Keyword.id, Keyword.keyword
    FROM `keywords` as Keyword
    LEFT OUTER JOIN `keywords_requests` kr
    ON (
      keyword_id = id
      AND status = 'active'
      AND source_id = '29'
      AND created > FROM_UNIXTIME(1234567890)
    )
    WHERE keyword_id IS NULL
    

    This ideally should use NESTED LOOPS on your index.

  2. Create a composite index on (status, source_id, created) and make it so:

    SELECT Keyword.id, Keyword.keyword
    FROM `keywords` as Keyword
    LEFT OUTER JOIN (
      SELECT *
      FROM `keywords_requests` kr
      WHERE
        status = 'success'
        AND source_id = '29'
        AND created > FROM_UNIXTIME(1234567890)
      UNION ALL
      SELECT *
      FROM `keywords_requests` kr
      WHERE
        status = 'active'
        AND source_id = '29'
        AND created > FROM_UNIXTIME(1234567890)
    )
    ON keyword_id = id
    WHERE keyword_id IS NULL
    

    This will hopefully use HASH JOIN on even more restricted hash table.

Quassnoi
Wow, this cut the query time down to nothing in a few cases, and less than a second in the common case. Very nice. A follow-on question: the query time gets longer the further back I look, so could I optimize further with a multi-column index?
A: 

Try this SELECT Keyword.id, Keyword.keyword FROM keywords as Keyword LEFT JOIN (select * from keywords_requests where source_id = '29' and (status = 'success' OR status = 'active') AND source_id = '29' AND created > FROM_UNIXTIME(1234551323) AND id IS NULL ) as KeywordsRequest ON ( KeywordsRequest.keyword_id = Keyword.id

) GROUP BY Keyword.id ORDER BY KeywordsRequest.created ASC;