views:

1234

answers:

5

I have a question about SQL Server indexes. I'm not a DBA and assume the answer is clear for those of you that are. I am using SQL Server 2008.

I have a table that is similar to the following (but has more columns):

CREATE TABLE [dbo].[Results](
    [ResultID] [int] IDENTITY(1,1) NOT NULL,
    [TypeID] [int] NOT NULL,
    [ItemID] [int] NOT NULL,
    [QueryTime] [datetime] NOT NULL,
    [ResultTypeID] [int] NOT NULL,
    [QueryDay]  AS (datepart(day,[querytime])) PERSISTED,
    [QueryMonth]  AS (datepart(month,[querytime])) PERSISTED,
    [QueryYear]  AS (datepart(year,[querytime])) PERSISTED,
 CONSTRAINT [PK_Results] PRIMARY KEY CLUSTERED 
(
    [ResultID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]

The important fields to notice here are ResultID, the primary key, and QueryTime the datetime at which the result was produced.

I also have the following index (amongst others):

CREATE NONCLUSTERED INDEX [IDX_ResultDate] ON [dbo].[Results] 
(
    [QueryTime] ASC
)
INCLUDE ( [ResultID],
[ItemID],
[TypeID]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]

In a database where I have about a million rows in the table, the index is used when doing a query such as:

select top 1 * from results where querytime>'2009-05-01' order by ResultID asc

In another instance of the same database, with 50 million rows, SQL Server decides not to use the index as it rather does a Clustered Index Scan which ends up being horribly slow. (and speed depends on the date). Even if I use query hints to make it use IDX_ResultDate, it is still a bit slow and it spends 94% of its time sorting by ResultID. I figured that by creating an index with both ResultID and QueryTime as sorted columns in the index, I could speed up my query.

I therefore created the following:

CREATE NONCLUSTERED INDEX [IDX_ResultDate2] ON [dbo].[Results] 
(
[QueryTime] ASC,    
[ResultID] ASC
)
INCLUDE ( [ItemID],
[TypeID]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
GO

I assumed that it would first use the sort by QueryTime to find the matching results, which would already be sorted by ResultID. However, this is not the case as this index changes nothing in performance over the existing one.

I then tried the following index:

CREATE NONCLUSTERED INDEX [IDX_ResultDate3] ON [dbo].[Results] 
(
    [ResultID] ASC,
    [QueryTime] ASC
)
INCLUDE ( [ItemID],
[TypeID]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 90) ON [PRIMARY]
GO

This one produces the intended result. It appears to return in constant time (a fraction of a second).

However, I am puzzled at why IDX_ResultDate3 works well whereas IDX_ResultDate2 doesn't.

I would assume that a binary search in as sorted list of QueryTime followed by peeking at the first result in it's child list of ResultIDs is the fastest way at getting the result. (Hence my initial sort order).

Side question: Should I create a persisted column with the date portion of QueryTime and index on that instead (I already have three persisted columns as you can see above)?

+2  A: 

You can change clustered index to ([QueryTime], [ResultID]), or change your query from

select top 1 * from results where querytime>'2009-05-01' order by ResultID asc

to

select top 1 <only the columns you actually need> from results where querytime>'2009-05-01' order by ResultID asc

and include all those columns in [IDX_ResultDate2]

AlexKuznetsov
+1 exactly - shoot for a "covering" index that includes all the fields needed to satisfy the query (if that's possible)
marc_s
Yep, already doing that (not posted here) but same kind of performance.
Jason Kealey
+7  A: 

I would assume that a binary search in as sorted list of QueryTime followed by peeking at the first result in it's child list of ResultIDs is the fastest way at getting the result. (Hence my initial sort order).

That would be fast indeed, but your query expresses a different request: You are asking for the Result with the minimal ResultId from all the queries that occurred after '2009-05-01'. To satisfy the request it has to seek at the beginning of the range ('2009-05-01'), start a scan from this position to extract all the ResultId, sort them then return the top 1 (the minimum ResultId). The second index you added [idx_ResultDate2] doesn't help much either. The query has to do pretty much exactly the same seek and scan: the ResultIds are sorted whithin a result date, so to find out the top ResultId from all results that are after '2009-05-01' the query still has to scan the index till end.

On your last index, [IDX_ResultDate3], the query is cheating. What it does it starts a scan on the inde and its looking at the QueryTime value, knowing that in this index scan the first Result that has a QueryTime in the desired range (> '2009-05-01') is the one you want (because the ResultId is guaranteed to be the Top 1). You get the result in a 'fraction of a second' out of pure luck: you have a matching Result in the beginning of the index. The query may well scan the entire index and match the very lat Result. You can insert a new Result with a QueryTime like '2010-01-01' and then seek for it, you'll see that the performance degrades as the query has to scan the entire index till end (still faster than a table scan because of the narrower index size).

My question is: are you absolutely positive that your query has to return TOP 1 in ORDER BY ResultID? Or you just picked the order by arbitrarily? If you can change the ORDER BY request to, say, QueryTime, then any of the index (updated: with QueryTime as leftmost column) will return a simple Seek and Fetch, no scansn and no sorting.

Remus Rusanu
Very good explanation. I understand now. I'll see if I can re-engineer the app to use the QueryTime sort.
Jason Kealey
+2  A: 

You have a ranged filtering condition on one field along with ORDER BY another field.

An index, even a composite index, cannot be used to serve both conditions in this case.

When you create an index on (queryTime, resultId), the index is used for filtering. The engine still needs to order the resultset.

When you create an index on (resultId, queryTime), the index is used for ordering.

Since you need a TOP 1 result and the row that satisifes this result happens to be in the beginning of the index, the latter approach turns out to perform better.

If your filtering condition would be selective (i. e. it would return few rows), and the first result you need would happen to be in the end of the index, the first approarch would be better.

See this article in my blog for some more explanations and hints on which index to create in which conditions:

Quassnoi
Nice blog post.
Jason Kealey
A: 

The first thing I would suggest is to check if the statistics for this table (all the indexes) are up-to-date.

Since you get two different execution plans with different data sets, it seems that SQL Server is making an infamous "judgement call" in picking one execution plan over another.

I agree with Remus's explanation of why you are getting "magical" results with your last index.

His suggestion is also good - do you really want to order by resultID? Or if you can order by queryTime, then you will have MUCH better performance because the execution plan will be able to use the index order as the result set's order (AND it will seek through the index, vs. scanning).

Jeff Meatball Yang
Yes, the statistics are up to date. (and yes, it needs to be sorted... unfortunately!)
Jason Kealey
A: 

I'm not sure I can answer the question but would point out that the clustered index key is already included as part of any other index, so its redundant to include ResultID as part of any of the other indexes you propose.

JNappi