views:

222

answers:

5

I need help in indexing in MySQL. I have a table in MySQL with following rows:

ID Store_ID Feature_ID Order_ID Viewed_Date Deal_ID IsTrial
The ID is auto generated. Store_ID goes from 1 - 8. Feature_ID from 1 - let's say 100. Viewed Date is Date and time on which the data is inserted. IsTrial is either 0 or 1.
You can ignore Order_ID and Deal_ID from this discussion.

There are millions of data in the table and we have a reporting backend that needs to view the number of views in a certain period or overall where trial is 0 for a particular store id and for a particular feature.

The query takes the form of:

select count(viewed_date) 
from theTable
where viewed_date between '2009-12-01' and '2010-12-31'
and store_id = '2' 
and feature_id = '12'
and Istrial = 0

In SQL Server you can have a filtered index to use for Istrial. Is there anything similar to this in MySQL? Also, Store_ID and Feature_ID have a lot of duplicate data. I created an index using Store_ID and Feature_ID. Although this seems to have decreased the search period, I need better improvement than this. Right now I have more than 4 million rows. To search for a particular query like the one above, it looks at 3.5 million rows in order to give me the count of 500k rows.

PS. I forgot to add view_date filter in the query. Now I have done this.

A: 

If you need to optimize this query specifically in MySQL, why not add istrial to the end of the existing index on Store_ID and Feature_ID. This will completely index away the WHERE clause and will be able to grab the COUNT from the cardinality summary of the index if the table is MyISAM. All of your existing queries that leverage the current index will be unchanged as well.

edit: also, I'm unsure of why you're doing COUNT(viewed_date) instead of COUNT(*)? Is viewed_date ever NULL? If not, you can just use the COUNT(*) which will eliminate the need to go to the .MYD file if you take it in conjunction with my other suggestion.

Mike Sherov
That would be sort of bad practice. Good idea,nevertheless.
Tesnep
why is it bad practice?
Mike Sherov
P.S. See my edit.
Mike Sherov
Where I am referring is in where you have mentioned appending isTrial to the end of the exsting Store_ID and Feature_Id. If your suggestion is to add 0 or 1 to the end of Store_ID, you will have to remember that there is relationship with other tables as well. If not, can you help me understand better what you mean by this? I apologize if I misunderstood you.
Tesnep
He suggested adding it to the existing index, not the columns.
wallenborn
wallenborn is correct. If I was unclear, I specified "existing index".
Mike Sherov
A: 

Well you could expand your index to consist of Store_ID, Feature_ID and IsTrial. You won't get any better than this, performancewise.

aefxx
A: 

My first idea would be an index on (feature_id, store_id, istrial), since feature_id seems to be the column with the highest Shannon entropy. But without knowing the statistics on feature_id i'm not sure. Maybe you should better create two indexes, (store_id, feature_id, istrial) being the other and let the optimizer sort it out. Using all three columns also has the advantage of the database being able to answer your query from the index alone, which should improve performance, too.

But if neither of your columns is selective enough to sufficiently improve index performance, you might have to resort to denormalization by using INSERT/UPDATE triggers to fill a second table (feature_id, store_id, istrial, view_count). This would slow down inserts and updates, of course...

wallenborn
the query wouldn't be answered by the index alone if the data you're selecting isn't in the index. Yes, the rows to return will be completely specified by the index, but it would still need to go to the .MYD file to get the actual data, and EXPLAIN wouldn't say "Using Index".
Mike Sherov
You're right, i overlooked that count(viewed_date) has to look for notNULLness and neds to hit the hard disk for that.
wallenborn
A: 

You might want to think about splitting that table horizontally. You could run a nightly job that puts each store_id in a separate table. Or take a look at feature_id, yeah, it's a lot of tables but if you don't need real-time data. It's the route I would take.

Dan Williams
I was thinking more on the side of creating a separate reports table for historical data by also adding the totals column. The problem is I can only do for specific Viewed_Date periods, like Dec 09 to Jan 2010. If the time period changes for queries, then it would still have to look into this table. And yes, the client(s) do need real-time data.
Tesnep
A: 

The best way I found in tackling this problem is to skip DTA's recommendation and do it on my own in the following way:

  • Use Profiler to find the costliest queries in terms of CPU usage (probably blocking queries) and apply indexes to tables based on those queries. If the query execution plan can be changed to decrease the Read, Writes and overall execution time, then first do that. If not, in which case the query is what it is, then apply clustered/non-clustered index combination to best suit. This depends on the nature of the existing table indexes, the bytes total of columns participating in index, etc.
  • Run queries in the SSMS to find the most frequently executing queries and do the same as above.
  • Create a defragmentation schedule in order to either Reorganize or Rebuild indexes depending on how much fragmented they are.

I am pretty sure others can suggest good ideas. Doing these gave me good results. I hope someone can use this help. I think DTA does not really make things faster in terms of indexing because you really need to go through what all indexes it is going to create. This is more true for a database that gets hit a lot.

Tesnep