ansaurus

Question

Answer 1

+1 A:

Well, start by making sure that file_last_access is an index for the table HugeFileTable.

I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.

The comparison would be something like:

 t.file_last_access >= StartDate AND t.file_last_access <= EndDate

Aziz 2009-05-01 18:17:12

Thanks for your response. I dont think this will improve performance very much, it takes out one comparison (in days) but thats not really where all the slowdowns are coming from. Also, I can't make that column an index but I don't see how it would help anyway.

Zenshai 2009-05-01 19:21:01

@zenshaiUsing an index (B-tree) speeds up the query by allowing mysql to essentially bail early on any file_last_access values that are outside of your desired range.Without the index your doing a table scan which is O(N) comparisons where N is the number of rows in the table. With an index you do O(M) where M is the number of matching rows and M <= N.

mattkemp 2009-05-06 04:45:57

@mattkemp: ah I see what you're saying, I was thinking in terms of my original query not this modification suggested by Aziz. The problem is that creating an index on these huge tables takes long and eats a lot of space, and I have to run the same query not just on last access date but also modified date and creation date, so I'd need indexes on those too.

Zenshai 2009-05-08 16:45:08

Answer 2

A:

You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.

ontangent 2009-05-01 22:43:44

Answer 3

+2 A:

First, create an index on HugeFileTable.file_last_access.

Then try the following query:

SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
 JOIN `HugeFileTable` t 
 ON (t.file_last_access BETWEEN 
   CURDATE() + INTERVAL r.range_start DAY AND 
   CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;

Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):

+-------+-------+------------------+----------------------------------------------+
| table | type  | key              | Extra                                        |
+-------+-------+------------------+----------------------------------------------+
| t     | index | file_last_access | Using index; Using temporary; Using filesort | 
| r     | ALL   | NULL             | Using where                                  | 
+-------+-------+------------------+----------------------------------------------+

It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.

But at least this query eliminates the Cartesian product that you had in your original query.

update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.

SELECT r.range_name,
  (SELECT COUNT(*) 
   FROM `HugeFileTable` t 
   WHERE t.file_last_access BETWEEN 
     CURDATE() - INTERVAL r.range_end DAY AND 
     CURDATE() - INTERVAL r.range_start DAY
  ) as FileCount
FROM `DateRanges` r;

The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):

+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type        | table | type  | key              | Extra                    |
+----+--------------------+-------+-------+------------------+--------------------------+
|  1 | PRIMARY            | r     | ALL   | NULL             |                          | 
|  2 | DEPENDENT SUBQUERY | t     | index | file_last_access | Using where; Using index | 
+----+--------------------+-------+-------+------------------+--------------------------+

Try this query on your data set and see if it performs better.

Bill Karwin 2009-05-02 01:56:40

Thanks, ill try that and update in the comments.

Zenshai 2009-05-06 16:16:11

So I ran your query without an index (I really didn't want to make one if I didn't have to) and it completed in 9182 seconds (2.5hrs), which is actually quite acceptable, since ill be running it overnight about once a week. So, thank you very much for your help.

Zenshai 2009-05-08 22:59:14

Ok, as long as it's acceptable, that's your call. I'd suggest trying it with the index for comparison. If it's not significantly faster, then you can always drop the index after that test.

Bill Karwin 2009-05-09 00:56:25

ansaurus

tags:

views:

answers:

Optimizing MySQL Aggregation Query

related questions