ansaurus

Question

Indexing table with duplicates MySQL/SQL Server with millions of records

Answer 1

A:

If you need to optimize this query specifically in MySQL, why not add istrial to the end of the existing index on Store_ID and Feature_ID. This will completely index away the WHERE clause and will be able to grab the COUNT from the cardinality summary of the index if the table is MyISAM. All of your existing queries that leverage the current index will be unchanged as well.

edit: also, I'm unsure of why you're doing COUNT(viewed_date) instead of COUNT(*)? Is viewed_date ever NULL? If not, you can just use the COUNT(*) which will eliminate the need to go to the .MYD file if you take it in conjunction with my other suggestion.

Mike Sherov 2010-02-05 13:25:26

That would be sort of bad practice. Good idea,nevertheless.

Tesnep 2010-02-05 13:43:10

why is it bad practice?

Mike Sherov 2010-02-05 13:47:03

P.S. See my edit.

Mike Sherov 2010-02-05 13:48:31

Where I am referring is in where you have mentioned appending isTrial to the end of the exsting Store_ID and Feature_Id. If your suggestion is to add 0 or 1 to the end of Store_ID, you will have to remember that there is relationship with other tables as well. If not, can you help me understand better what you mean by this? I apologize if I misunderstood you.

Tesnep 2010-02-05 14:05:03

He suggested adding it to the existing index, not the columns.

wallenborn 2010-02-05 14:15:42

wallenborn is correct. If I was unclear, I specified "existing index".

Mike Sherov 2010-02-05 14:24:00

Answer 2

A:

Well you could expand your index to consist of Store_ID, Feature_ID and IsTrial. You won't get any better than this, performancewise.

aefxx 2010-02-05 13:27:13

Answer 3

A:

My first idea would be an index on (feature_id, store_id, istrial), since feature_id seems to be the column with the highest Shannon entropy. But without knowing the statistics on feature_id i'm not sure. Maybe you should better create two indexes, (store_id, feature_id, istrial) being the other and let the optimizer sort it out. Using all three columns also has the advantage of the database being able to answer your query from the index alone, which should improve performance, too.

But if neither of your columns is selective enough to sufficiently improve index performance, you might have to resort to denormalization by using INSERT/UPDATE triggers to fill a second table (feature_id, store_id, istrial, view_count). This would slow down inserts and updates, of course...

wallenborn 2010-02-05 13:29:33

the query wouldn't be answered by the index alone if the data you're selecting isn't in the index. Yes, the rows to return will be completely specified by the index, but it would still need to go to the .MYD file to get the actual data, and EXPLAIN wouldn't say "Using Index".

Mike Sherov 2010-02-05 13:42:30

You're right, i overlooked that count(viewed_date) has to look for notNULLness and neds to hit the hard disk for that.

wallenborn 2010-02-05 14:14:38

Answer 4

A:

You might want to think about splitting that table horizontally. You could run a nightly job that puts each store_id in a separate table. Or take a look at feature_id, yeah, it's a lot of tables but if you don't need real-time data. It's the route I would take.

Dan Williams 2010-02-05 13:41:56

I was thinking more on the side of creating a separate reports table for historical data by also adding the totals column. The problem is I can only do for specific Viewed_Date periods, like Dec 09 to Jan 2010. If the time period changes for queries, then it would still have to look into this table. And yes, the client(s) do need real-time data.

Tesnep 2010-02-05 13:48:49

Answer 5

A:

The best way I found in tackling this problem is to skip DTA's recommendation and do it on my own in the following way:

Use Profiler to find the costliest queries in terms of CPU usage (probably blocking queries) and apply indexes to tables based on those queries. If the query execution plan can be changed to decrease the Read, Writes and overall execution time, then first do that. If not, in which case the query is what it is, then apply clustered/non-clustered index combination to best suit. This depends on the nature of the existing table indexes, the bytes total of columns participating in index, etc.
Run queries in the SSMS to find the most frequently executing queries and do the same as above.
Create a defragmentation schedule in order to either Reorganize or Rebuild indexes depending on how much fragmented they are.

I am pretty sure others can suggest good ideas. Doing these gave me good results. I hope someone can use this help. I think DTA does not really make things faster in terms of indexing because you really need to go through what all indexes it is going to create. This is more true for a database that gets hit a lot.

Tesnep 2010-06-09 12:16:27

ansaurus

tags:

views:

answers:

Indexing table with duplicates MySQL/SQL Server with millions of records

related questions