views:

369

answers:

3

Hi All,

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.

I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:

DateRanges
range_id   range_name   range_start   range_end
1          0-90         0             90
2          91-180       91            180
3          181-365      181           365
4          366-1095     366           1095
5          1096+        1096          999999999

And wrote a query that looks like this:

SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name

However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.

What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....

Can anyone help out with this?

Thanks.

EDIT: MySQL Version: 5.0.45, Tables are MyISAM

EDIT2: Here's the descibe that was asked for in the comments

id  select_type  table  type  possible_keys  key  key_len  ref  rows      Extra  
1   SIMPLE       r      ALL   NULL           NULL NULL     NULL 5         Using temporary; Using filesort 
1   SIMPLE       t      ALL   NULL           NULL NULL     NULL 96506321
+1  A: 

Well, start by making sure that file_last_access is an index for the table HugeFileTable.

I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.

The comparison would be something like:

 t.file_last_access >= StartDate AND t.file_last_access <= EndDate
Aziz
Thanks for your response. I dont think this will improve performance very much, it takes out one comparison (in days) but thats not really where all the slowdowns are coming from. Also, I can't make that column an index but I don't see how it would help anyway.
Zenshai
@zenshaiUsing an index (B-tree) speeds up the query by allowing mysql to essentially bail early on any file_last_access values that are outside of your desired range.Without the index your doing a table scan which is O(N) comparisons where N is the number of rows in the table. With an index you do O(M) where M is the number of matching rows and M <= N.
mattkemp
@mattkemp: ah I see what you're saying, I was thinking in terms of my original query not this modification suggested by Aziz. The problem is that creating an index on these huge tables takes long and eats a lot of space, and I have to run the same query not just on last access date but also modified date and creation date, so I'd need indexes on those too.
Zenshai
A: 

You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.

ontangent
+2  A: 

First, create an index on HugeFileTable.file_last_access.

Then try the following query:

SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
 JOIN `HugeFileTable` t 
 ON (t.file_last_access BETWEEN 
   CURDATE() + INTERVAL r.range_start DAY AND 
   CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;

Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):

+-------+-------+------------------+----------------------------------------------+
| table | type  | key              | Extra                                        |
+-------+-------+------------------+----------------------------------------------+
| t     | index | file_last_access | Using index; Using temporary; Using filesort | 
| r     | ALL   | NULL             | Using where                                  | 
+-------+-------+------------------+----------------------------------------------+

It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.

But at least this query eliminates the Cartesian product that you had in your original query.


update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.

SELECT r.range_name,
  (SELECT COUNT(*) 
   FROM `HugeFileTable` t 
   WHERE t.file_last_access BETWEEN 
     CURDATE() - INTERVAL r.range_end DAY AND 
     CURDATE() - INTERVAL r.range_start DAY
  ) as FileCount
FROM `DateRanges` r;

The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):

+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type        | table | type  | key              | Extra                    |
+----+--------------------+-------+-------+------------------+--------------------------+
|  1 | PRIMARY            | r     | ALL   | NULL             |                          | 
|  2 | DEPENDENT SUBQUERY | t     | index | file_last_access | Using where; Using index | 
+----+--------------------+-------+-------+------------------+--------------------------+

Try this query on your data set and see if it performs better.

Bill Karwin
Thanks, ill try that and update in the comments.
Zenshai
So I ran your query without an index (I really didn't want to make one if I didn't have to) and it completed in 9182 seconds (2.5hrs), which is actually quite acceptable, since ill be running it overnight about once a week. So, thank you very much for your help.
Zenshai
Ok, as long as it's acceptable, that's your call. I'd suggest trying it with the index for comparison. If it's not significantly faster, then you can always drop the index after that test.
Bill Karwin