views:

106

answers:

4

Hello, I'm troubleshooting a query performance problem. Here's an expected query plan from explain:

mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:16';
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table              | type  | possible_keys | key          | key_len | ref  | rows    | Extra       |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
|  1 | SIMPLE      | table1             | range | tdcol         | tdcol        | 8       | NULL | 5437848 | Using where | 
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
1 row in set (0.00 sec)

That makes sense, since the index named tdcol (KEY tdcol (tdcol)) is used, and about 5M rows should be selected from this query.

However, if I query for just one more minute of data, we get this query plan:

mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:17';
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table              | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
|  1 | SIMPLE      | table1             | ALL  | tdcol         | NULL | NULL    | NULL | 381601300 | Using where | 
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)

The optimizer believes that the scan will be better, but it's over 70x more rows to examine, so I have a hard time believing that the table scan is better.

Also, the 'USE KEY tdcol' syntax does not change the query plan.

Thanks in advance for any help, and I'm more than happy to provide more info/answer questions.

A: 

MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?

I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.

staticsan
Thanks staticsan. What you're saying makes sense, and in fact this box has a paucity of memory compared to the table size. Interestingly, however, I reconfigured the buffer pool up to 27GB of memory from 5.5GB of memory with no other changes and it did not change the query plans at all.
andysk
A: 

What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.

It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.

Crewe
This is a datetime field that I am trying to select on that spans more than a year. Recent days average about 5M records. It's pretty evenly distributed. In this case, that one minute is 1200 records, which is not significant. Agreed that my settings are important - my settings pretty much conform to this (http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/), but I will have to check into page size and explicitly defining a B+ Tree index.
andysk
A: 

5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).

This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.

In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).

Doug McClean
Thank you Doug. Your assumptions are true - this datetime column `tdcol` is not in a clustered index, simply a plain index - because unfortunately the value is not unique, we have multiple records per second. And yes, the index ordering matches closely with the order of inserts/layout of data on disk, but the optimizer has no way to know this. And you make an interesting point in terms of pages - if the indexed rows are distributed randomly and there are at least 70 rows per page, that alone would justify the scan.
andysk
MySQL requires clustered indices to be unique? Why?
Doug McClean
That's the way I read this: http://dev.mysql.com/doc/refman/5.1/en/innodb-index-types.html I do not see a reason why a clustered index should have to be unique.
andysk
Strange. Maybe there's some connection to multi-version concurrency control or MySQL/InnoDB's implementation thereof?
Doug McClean
Perhaps - like identical serialization would not be guaranteed with a naive implementation using a non-unique index? If someone can prove me wrong about clustered indexes needing to be unique in MySQL, I'd sure be pleased!
andysk
A: 

"so I have a hard time believing that the table scan is better."

True. YOU have a hard time believing it. But the optimizer seems not to.

I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.

That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?

If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".

Erwin Smout
Thanks for the feedback Erwin - How would I look at the MAX in statistics since it's not in table status?
andysk