views:

122

answers:

7

This query selects all the unique visitor sessions in a certain date range:

select distinct(accessid) from accesslog where date > '2009-09-01'

I have indexes on the following fields:

  • accessid
  • date
  • some other fields

Here's what explain looks like:

mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table     | type  | possible_keys        | key  | key_len | ref  | rows  | Extra                        |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
|  1 | SIMPLE      | accesslog | range | date,dateurl,dateaff | date | 3       | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+


mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table     | type  | possible_keys | key      | key_len | ref  | rows    | Extra       |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
|  1 | SIMPLE      | accesslog | index | NULL          | accessid | 257     | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+

Why doesn't the query with the date clause use the accessid index?

Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?

Edit - Resolution

Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.

Adding a date+accessid index had no effect on query time.

A: 

no, you have all indices there that you need:

  • on your id
  • on your date
knittl
A: 

Why doesn't the query with the date clause not use the accessid index?

Because using the date index is more efficient. That's because it's likely to pare the search space down faster.

At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.

Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).

paxdiablo
A: 

The query uses the 'date' index because thats what you use in the where clause.

This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.

If this is a really big table a compound index on date and accessid might help.

James Anderson
A: 

I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.

Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.

Jan Zich
+2  A: 

Your problem is that your condition is a range clause (on the date column).

A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.

You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.

Edit:

My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.

Michael
Yep, adding the date+accessid index had no effect. Bummer.
Ian
A: 

Why doesn't the query with the date clause not use the accessid index?

Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.

If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".

Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?

Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.

If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.

Jonathan Leffler
+4  A: 

An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?

Josh Davis
Ah, good call. The accessid we use is a 32 character string, for example 94a1d523fac45e589e7d8884332fa9b3. I'll reduce the size of this field.
Ian
Making this change improved performance by about 75%.
Ian
Glad it worked for you ;)
Josh Davis