ansaurus

Question

Why is MySQL with InnoDB doing a table scan when key exists and choosing to examine 70 times more rows?

Answer 1

A:

MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?

I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.

staticsan 2010-04-15 01:22:01

Thanks staticsan. What you're saying makes sense, and in fact this box has a paucity of memory compared to the table size. Interestingly, however, I reconfigured the buffer pool up to 27GB of memory from 5.5GB of memory with no other changes and it did not change the query plans at all.

andysk 2010-04-17 01:44:11

Answer 2

A:

What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.

It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.

Crewe 2010-04-15 09:14:48

This is a datetime field that I am trying to select on that spans more than a year. Recent days average about 5M records. It's pretty evenly distributed. In this case, that one minute is 1200 records, which is not significant. Agreed that my settings are important - my settings pretty much conform to this (http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/), but I will have to check into page size and explicitly defining a B+ Tree index.

andysk 2010-04-15 22:38:59

Answer 3

A:

5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).

This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.

In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).

Doug McClean 2010-04-15 18:40:53

Thank you Doug. Your assumptions are true - this datetime column `tdcol` is not in a clustered index, simply a plain index - because unfortunately the value is not unique, we have multiple records per second. And yes, the index ordering matches closely with the order of inserts/layout of data on disk, but the optimizer has no way to know this. And you make an interesting point in terms of pages - if the indexed rows are distributed randomly and there are at least 70 rows per page, that alone would justify the scan.

andysk 2010-04-17 01:51:16

MySQL requires clustered indices to be unique? Why?

Doug McClean 2010-04-17 03:21:32

That's the way I read this: http://dev.mysql.com/doc/refman/5.1/en/innodb-index-types.html I do not see a reason why a clustered index should have to be unique.

andysk 2010-04-19 22:05:03

Strange. Maybe there's some connection to multi-version concurrency control or MySQL/InnoDB's implementation thereof?

Doug McClean 2010-04-20 01:43:07

Perhaps - like identical serialization would not be guaranteed with a naive implementation using a non-unique index? If someone can prove me wrong about clustered indexes needing to be unique in MySQL, I'd sure be pleased!

andysk 2010-04-20 07:24:40

Answer 4

A:

"so I have a hard time believing that the table scan is better."

True. YOU have a hard time believing it. But the optimizer seems not to.

I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.

That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?

If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".

Erwin Smout 2010-04-15 20:51:35

Thanks for the feedback Erwin - How would I look at the MAX in statistics since it's not in table status?

andysk 2010-04-17 01:40:44

ansaurus

tags:

views:

answers:

Why is MySQL with InnoDB doing a table scan when key exists and choosing to examine 70 times more rows?

related questions