ansaurus

Question

Very slow MYSQL query for 2.5 million row table

Answer 1

+3 A:

Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps

There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.

(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates

By reversing the index, the approach is

> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file

Hope this helps

Sparky 2009-11-24 16:03:13

This answer will save the need of using another index but migth affect other queries that depend on type and not on play_date. It is a trade off.

borjab 2009-11-24 16:08:30

Hi Sparky,I've created the new index but MySQL won't use it, so i used FORCE INDEX but the where seems to return all rows?id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE log range play_date_type play_date_type 4 NULL 2519022 Using where

Tom 2009-11-24 16:44:05

Answer 2

A:

Try creating an index only on play_date.

borjab 2009-11-24 16:04:36

Answer 3

+4 A:

Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.

My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.

Asaph 2009-11-24 16:05:00

Unfortunately the stats need to be live

Tom 2009-11-24 16:23:50

Answer 4

+3 A:

Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).

Eimantas 2009-11-24 16:05:34

Answer 5

A:

Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.

Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.

Brian Mansell 2009-11-24 16:09:58

Answer 6

A:

There is a good chance a table scan will be quicker than random access to over 200,000 rows:

SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...

Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):

SELECT SQL_BIG_RESULT ...

Andrew Duffy 2009-11-24 16:17:28

Answer 7

+9 A:

That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.

You could be able to make the count an index-only operation (i.e. the actual rows do not need to be touched, only the index) by changing the index as follows:

KEY `type_4` (`type`,`play_date`, `email`)

I would be surprised if that wouldn't speed things up quite a bit.

Henning 2009-11-24 16:20:25

Just tried this and got a massive improvement, query time is now under 3 seconds, which we can live with. Thank you very much!

Tom 2009-11-24 16:52:36

I think that for this particular query, removing type will also help.

Pop Catalin 2009-11-24 16:57:20

This is called a "Covering index"; it happens when all the columns used in the query are in an index. In this case, they also happen to be in the right order so that the GROUP BY can be optimised into a simple range scan (no sorting required) of the covering index, which is a really good type of query plan even if there are quite a lot of rows; there will be good locality of the data.

MarkR 2009-11-24 21:48:47

Answer 8

+1 A:

The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.

SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
    SELECT play_date AS date
      FROM log
     WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
       AND type = 'play'
     ORDER BY play_date desc
     LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date

Ewan Todd 2009-11-24 16:23:53

... except that he says the count() piece only kills him for 8 seconds of the 20.

Ewan Todd 2009-11-24 16:35:07

ansaurus

tags:

views:

answers:

Very slow MYSQL query for 2.5 million row table

related questions