views:

650

answers:

8

Hi,

I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds

here is the query

SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;

 `id` int(11) NOT NULL auto_increment,
  `instance` varchar(255) NOT NULL,
  `email` varchar(255) NOT NULL,
  `type` enum('play','claim','friend','email') NOT NULL,
  `result` enum('win','win-small','lose','none') NOT NULL,
  `timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `play_date` date NOT NULL,
  `email_refer` varchar(255) NOT NULL,
  `remote_addr` varchar(15) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `email` (`email`),
  KEY `result` (`result`),
  KEY `timestamp` (`timestamp`),
  KEY `email_refer` (`email_refer`),
  KEY `type_2` (`type`,`timestamp`),
  KEY `type_4` (`type`,`play_date`),
  KEY `type_result` (`type`,`play_date`,`result`)

id  select_type table type possible_keys key key_len ref rows Extra
1   SIMPLE log ref type_2,type_4,type_result type_4 1 const 270404 Using where

The query is using the type_4 index.

Does anyone know how I could speed this query up?

Thanks Tom

+3  A: 

Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps

There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.

(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates

By reversing the index, the approach is

> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file

Hope this helps

Sparky
This answer will save the need of using another index but migth affect other queries that depend on type and not on play_date. It is a trade off.
borjab
Hi Sparky,I've created the new index but MySQL won't use it, so i used FORCE INDEX but the where seems to return all rows?id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE log range play_date_type play_date_type 4 NULL 2519022 Using where
Tom
A: 

Try creating an index only on play_date.

borjab
+4  A: 

Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.

My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.

Asaph
Unfortunately the stats need to be live
Tom
+3  A: 

Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).

Eimantas
A: 

Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.

Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.

Brian Mansell
A: 

There is a good chance a table scan will be quicker than random access to over 200,000 rows:

SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...

Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):

SELECT SQL_BIG_RESULT ...
Andrew Duffy
+9  A: 

That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.

You could be able to make the count an index-only operation (i.e. the actual rows do not need to be touched, only the index) by changing the index as follows:

KEY `type_4` (`type`,`play_date`, `email`)

I would be surprised if that wouldn't speed things up quite a bit.

Henning
Just tried this and got a massive improvement, query time is now under 3 seconds, which we can live with. Thank you very much!
Tom
I think that for this particular query, removing type will also help.
Pop Catalin
This is called a "Covering index"; it happens when all the columns used in the query are in an index. In this case, they also happen to be in the right order so that the GROUP BY can be optimised into a simple range scan (no sorting required) of the covering index, which is a really good type of query plan even if there are quite a lot of rows; there will be good locality of the data.
MarkR
+1  A: 

The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.

SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
    SELECT play_date AS date
      FROM log
     WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
       AND type = 'play'
     ORDER BY play_date desc
     LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date
Ewan Todd
... except that he says the count() piece only kills him for 8 seconds of the 20.
Ewan Todd