ansaurus

Question

MySQL query with dependent subquery takes too long

Answer 1

A:

I don't know if it is faster, but try this one:

SELECT
  MIN( `quantities`.`start_timestamp` ) AS `start`,
  MAX( `quantities`.`end_timestamp` ) AS `end`,
  ( `quantities`.`quantity` * AVG (`prices`.`price`) * COUNT (`prices`.`price`)) AS `total`
FROM `quantities`
LEFT JOIN `prices`
  ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
  AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
  AND `prices`.`type_id` = 1
GROUP BY HOUR(  `quantities`.`start_timestamp` );

Also compare the results, because the logic is a little different.

I don't do SUM(quantety * AVG(price)

I do AVG(price) * COUNT(price) * quantety

JochenJung 2010-07-22 10:40:13

Thanks JochenJung but I get ERROR 1111 (HY000): Invalid use of group function

neilcrookes 2010-07-22 10:42:23

I forgot a closing bracket. Pleas try again.

JochenJung 2010-07-22 10:47:08

I now get ERROR 1305 (42000): FUNCTION COUNT does not exist.... weird!

neilcrookes 2010-07-22 10:48:18

`Invalid use of group function` -- the `AVG()` inside the `SUM()` doesn't work

gnarf 2010-07-22 10:49:12

I removed the space between COUNT and the brace and query now runs, but it still takes 5.2 secs. Thanks for trying though.

neilcrookes 2010-07-22 10:51:57

I also get different results to my query

neilcrookes 2010-07-22 10:54:07

@neilcrookes, JochenJung - if interested why COUNT was perceived as function check this question http://stackoverflow.com/questions/2476307/how-does-mysql-define-distinct-in-reference-documentation (was my first question on so :) ). Maybe shorter explanation here here - http://dev.mysql.com/doc/refman/5.1/en/function-resolution.html (see IGNORE SPACE server option)

Unreason 2010-07-22 14:01:34

Answer 2

+2 A:

This should return the same results and perform slightly faster:

SELECT
  MIN( `quantities`.`start_timestamp` ) AS `start`,
  MAX( `quantities`.`end_timestamp` ) AS `end`,
  SUM( `quantities`.`quantity` * `prices`.`price` ) 
   * COUNT(DISTINCT `quantities`.`id`) 
   / COUNT(DISTINCT `prices`.`id`)
    AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
  AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

Since you can't calculate AVG() inside the SUM(), I had to do some interesting COUNT(DISTINCT) to calculate the number of prices returned per quantities. I'm wondering if this gives you the same results with "real" data...

Using JOIN:

+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys                 | key             | key_len | ref  | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8       | NULL |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ALL   | timestamp,type_id             | NULL            | NULL    | NULL | 36862 |    62.20 | Using where; Using join buffer               |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+

vs. the same query only adding LEFT to the JOIN

+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys     | key             | key_len | ref   | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 |   100.00 |                                              |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+

Interesting that LEFT can completely removes the end_timestamp as a possible key, and changes the selected keys so much, making it take 15 times as long...

This reference page could help you out a little more if you want to look at specifying index hints for your JOINS

gnarf 2010-07-22 11:09:06

+1 This is good, also adding composite indexes on (start_timestamp, end_timestamp) and on (type_id, timestamp) should help. However, I think I'll be able to bring it down to ~0.01 sec

Unreason 2010-07-22 11:17:29

@Unreason --- *scratches head* you say +1, but noone has voted yet ;) --- `</impatience>` --- I'm interested in seeing how you get it down that far!

gnarf 2010-07-22 11:27:24

Thanks gnarf, this is almost spot on. It is running in about 0.4 secs on my machine but the results are different to my original query. The reason I think is because you are dividing by COUNT(`prices`.`price`), which with the GROUP clause and this data will be 4 quantity rows * 3 price rows = 12, but if you divide by 3 then it generates the same results as my original query. Trouble is I don't want to hard code 3 in the query, but I can't figure out what the SQL is to derive that value from the data. Once that part is sorted, it'll all be perfect. Any ideas much appreciated?

neilcrookes 2010-07-22 11:43:13

@neilcrookes - The query is now giving me the same results with the test data... But that doesn't mean its the same calculation... In fact if you delete only one price from that time range, the calcuation returns different results from your query. This answer doesn't quite work...

gnarf 2010-07-22 12:18:04

@gnarf, using your latest query I tested it with deleting a whole hours worth of consecutive prices, 45 mins worth, 30 mins worth, 15 mins worth and just 5 mins worth and I get the almost the same results. The differences were, in my original I do get a row for the hour with no prices whatsoever with a total of NULL, but I don't get a row from your query. Also in your query the MIN and MAX start and end timestamps are ones where I still have prices. However, most totals match apart from the hour where I deleted only one of the prices - in this case there was a very small discrepancy.

neilcrookes 2010-07-22 13:06:46

neilcrookes 2010-07-22 13:13:49

@neil, ah you can have nulls... just posted the answer that runs fast but will have to do left joins and COALESCE to compensate for NULLS. Also one possible problem with neils query is that COUNT(DISTINCT ...) will fail on the prices which remain the same as those will be counted only once. I will also, time permitting write a more general query still aiming for high speed.

Unreason 2010-07-22 13:24:40

@gnarf, check my answer, as promised ~0.01 sec; also some notes on performance expectancy.

Unreason 2010-07-22 14:36:40

Answer 3

A:

Remember, just because you have indexes on your columns doesn't necessarily mean they'll run faster. As it stands, the index created is for each individual column, which, if you were only limiting the data on one column, would return results quite fast.

So to try and avoid "Using filesort" (which you need to do as much as possible), maybe try the following index:

CREATE INDEX start_timestamp_end_timestamp_id ON quantities (start_timestamp,end_timestamp,id);

And something similar for the prices table (combining the 3 individual indexes you have into 1 index for faster lookup)

An excellent resource which explains it in great detail and how to optimize your indexes (and what the different Explain's mean, and what to aim for) is: http://hackmysql.com/case1

AcidRaZor 2010-07-22 11:55:52

Thanks AcidRaZor, but adding this index and one on the prices table didn't do much to improve performance on either my original query or the one that @gnarf has suggested.

neilcrookes 2010-07-22 12:04:19

It was worth a shot :) However, I'd still recommend reading through the website I quoted as well. They go into more detail as to how you could enhance performance with your queries

AcidRaZor 2010-07-22 12:06:32

will do, thanks

neilcrookes 2010-07-22 12:43:23

Answer 4

+4 A:

Here is my first attempt. This one is dirty and uses the following properties on data:

there are three 5 minute prices for each quarter in quantities (if this is violated in data the query will not work)
notice for each and cardinality of three, this is not guaranteed by data integrity checks so therefore I call it dirty
it is also not flexible to changes in periods

Query 1:

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum((p1.price + p2.price + p3.price)/3*q.quantity) as total 
FROM 
    quantities q join 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join 
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join 
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE keyword which does not reuse the results, but on a warm database).

EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / ( 
         3 -
         COALESCE(p1.price-p1.price,1) - 
         COALESCE(p2.price-p2.price,1) - 
         COALESCE(p3.price-p3.price,1)
        )
       *q.quantity) as total 
FROM 
    quantities q LEFT JOIN 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine

SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE 
    `prices`.`timestamp` >= '2010-07-01 00:00:00' 
    AND `prices`.`timestamp` < '2010-07-02 00:00:00' 
    AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
    AND `prices`.`timestamp` < `quantities`.`end_timestamp`
    AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).

Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2

SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM 
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND 
                        p.timestamp < q.end_timestamp AND
                        p.timestamp >= '2010-07-01 00:00:00' AND 
                        p.`timestamp` < '2010-07-02 00:00:00' 
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00' 
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );

NOTE: Do not forget to remove sql_no_cache keywords in production.

There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.

So in general, I apply the following principle to set my expectations regarding the performance of a query:

if the base result set has < 1,000 rows then query should do its business in ~0.01 sec (base result set is the number of rows that functionally determine resulting set)

In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.

Unreason 2010-07-22 13:21:31

@Unreason, you are a legend, thanks very much. Query 2 returns perfect results in 0.0039 sec and Query 3 also returns perfect results in 0.1655 sec

neilcrookes 2010-07-22 14:27:45

@neilcrookes, You are welcome. Can you confirm that Query 2 runs faster then Query 3 on your machine? (Initially there was unmarked Query 1A, which I properly marked now. Also you should allow DB to worm up indexes, I usually run queries with `sql_no_cache` a few times for benchmarking).

Unreason 2010-07-22 14:33:48

@Unreason (could no longer edit first comment, so creating a new one), you are a legend, thanks very much. Query 1a returns perfect results in 0.0039 sec and Query 2 also returns perfect results in 0.1655 sec. Query 3 suffers the same issue as @gnarf's query in that it doesn't return a row where there are no prices in that hour and the start and end times correspond to the earliest and latest price records in that hour, but returns in 0.0144 sec. Query 1a is the winner. Thanks again. You're a life saver.

neilcrookes 2010-07-22 14:48:00

@neilcrookes, just for completeness sake - I have a bug in Query 3, `LEFT JOIN` servers no purpose if the query is later testing for `p.type_id = 1` because it filters out the rows with NULLs (that's why it drops those rows). However, correcting this error by putting in `(p.type_id = 1 OR p.type_id IS NULL)` slows down the query to ~0.2 sec. Will not edit the answer.

Unreason 2010-07-22 15:30:11

ansaurus

tags:

views:

answers:

MySQL query with dependent subquery takes too long

related questions