views:

149

answers:

5

I have a warehouse table that looks like this:

CREATE TABLE Warehouse (
  id BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
  eventId BIGINT(20) UNSIGNED NOT NULL,
  groupId BIGINT(20) NOT NULL,
  activityId BIGINT(20) UNSIGNED NOT NULL,
  ... many more ids,
  "txtProperty1" VARCHAR(255),
  "txtProperty2" VARCHAR(255),
  "txtProperty3" VARCHAR(255),
  "txtProperty4" VARCHAR(255),
  "txtProperty5" VARCHAR(255),
  ... many more of these
  PRIMARY KEY ("id")
  KEY "WInvestmentDetail_idx01" ("groupId"),
  ... several more indices
) ENGINE=INNODB;

Now, the following query spends about 0.8s in query time and 0.2s in fetch time, for a total of about one second. The query returns ~67,000 rows.

SELECT eventId
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
  AND scenarioId IS NULL
  AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;

Adding more ids to the select clause doesn't really change the performance at all.

SELECT eventId, groupId, activityId, insertDate
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
  AND scenarioId IS NULL
  AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;

However, adding a "property" column does change it to 0.6s fetch time and 1.8s query time.

SELECT eventId, txtProperty1
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
  AND scenarioId IS NULL
  AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;

Now to really blow your socks off. Instead of txtProperty1, using txtProperty2 changes the times to 0.8s fetch, 24s query!

SELECT eventId, txtProperty2
FROM Warehouse
WHERE accountId IN (10, 8, 13, 9, 7, 6, 12, 11)
  AND scenarioId IS NULL
  AND insertDate BETWEEN DATE '2002-01-01' AND DATE '2011-12-31'
ORDER BY insertDate;

The two columns are pretty much identical in the type of data they hold: mostly non-null, and neither are indexed (not that that should make a difference anyways). To be sure the table itself is healthy I ran analyze/optimize against it.

This is really mystifying to me. I can see why adding columns to the select clause only can slightly increase fetch time, but it should not change query time, especially not significantly. I would appreciate any ideas as to what is causing this slowdown.

EDIT - More data points

SELECT * actually outperforms txtProperty2 - 0.8s query, 8.4s fetch. Too bad I can't use it because the fetch time is (expectedly) too long.

A: 

I'll admit that this is a bit of a guess, but I'll give it a shot.

You have id -- the first field -- as the primary key. I'm not 100% sure how MySQL does clustered indexes as far as lookups, but it is reasonable to suspect that, for any given ID, there is some "pointer" to the record with that ID.

It is relatively easy to find the beginnings of fields when all prior fields have fixed width. All your BIGINT(20) fields have a defined size that makes it easy for the db engine to find the field given a pointer to the start of the record; it's a simple calculation. Likewise, the start of the first VARCHAR(255) field is easy to find. After that, though, because the fields are VARCHAR fields, the db engine must take the data into account to find the start of the next field, which is much slower than simply calculating where that field should be. So, for any fields after txtProperty1, you will have this issue.

What would happen if you changed all the VARCHAR(255) fields to CHAR(255) fields? It is very possible that your query will be much faster, albeit at the cost of using the maximum storage for each CHAR(255) field regardless of the data it actually contains.

Andrew
Sorry, no dice. Changing the first 5 properties to CHAR(255) actually made the query run in 98s query, 1.5s fetch. However, testing in this area has lead me to another odd discovery: selecting on txtProperty8 suffers the same penalty as txtProperty1 (only 2 seconds). txtProperty7 is somewhere in between (around 5 seconds). This whole thing is very, very strange.
Monkey Boson
A: 

Fragmented tablespace? Try a null alter table:

ALTER TABLE tbl_name ENGINE=INNODB
igelkott
Sorry.. it didn't work. Does this accomplish the same thing as optimize, in InnoDB?
Monkey Boson
I don't think so. Still think there might be some sort of tablespace error to explain the dramatic differences between nearly identical columns.
igelkott
+1  A: 

The MySQL documentation for the InnoDB engine suggests that if your varchar data doesn't fit on the page (i.e. the node of the b-tree structure), then the information will be referenced on overflow pages. So on your wide Warehouse table, it may be that txtProperty1 is on-page and txtProperty2 is off-page, thus requiring additional I/O to retrieve.

Not too sure as to why the SELECT * is better; it may be able to take advantage of reading data sequentially, rather than picking its way around the disk.

richaux
This scenario is entirely possible given my data. I'm a little surprise at the 2s -> 24s increase in retrieval time, though. Any ideas on how I can ameliorate the query time?
Monkey Boson
I don't have any practical experience of this: there appear to be 2 potential ways of getting more data on-page. a) You could try making the page wider: by setting the KEY_BLOCK_SIZE, or b) do you have any flexibility around datatype sizes, e.g. do you need the numerics to be BIGINT (would an unsigned INT or MEDIUMINT do?), and/or can the VARCHARs be just 100 length?
richaux
Looks like `SHOW TABLE STATUS` will show you the current `KEY_BLOCK_SIZE`. What is that value? And what to the column sizes up to txtProperty1 add up to?
Harold L
A: 

Since I am a SQL Server user and not a MySQL guy, this is a long shot. In SQL Server the clustered index is the table. All the table data is stored in the clustered index. Additional indexes store redundant copies of the indexed data sorted in the appropriate sort order.

My reasoning is this. As you add more and more data to the query, the fetch time remains negligible. I presume this is because you are fetching all the data from the clustered index during the query phase and there is effectively nothing left to do during the fetch phase.

The reason the SELECT * works the way it does is because your table is so wide. As long as you are just requesting the key and one or two additional columns, it is best to just get everything during the query. Once you ask for everything, it becomes cheaper to segregate the fetching between the two phases. I am guessing that if you add columns to your query one at a time, you will discover the boundary where the query analyzer switches from doing all of the fetching in the query phase to doing most of the fetching in the fetching phase.

J Edward Ellis
This sounds like the "covering index" technique mentioned by a few others. Is it still the case if neither txtProperty1 nor txtProperty2 are part of any index?
Monkey Boson
A: 

You should post the explain plans of the two queries so we can see what they are.

My guess is that the fast one is using a "Covering index", and the slow one isn't.

This means that the slow one must do 67,000 primary key lookups, which will be very inefficient if the table isn't all in memory (typically requiring 67k IO operations if the table is arbitrarily large and each row in its own page).

In MySQL, EXPLAIN will show "Using index" if a covering index is being used.

MarkR
The explain is identical in both cases. Even though the items in the where clause are indexed, MySQL is deciding in both cases to perform a full table scan (probably because 67000 represents a significant fraction of the entire table's size). In the last two queries I mentioned, neither can use the "covering index" technique because they both contain columns that are not indexed.
Monkey Boson