views:

477

answers:

6

I have a table with two indices; one is a multi-column clustered index, on a 3 columns:

(
   symbolid int16,
   bartime int32,
   typeid int8
)

The second is non clustered on

(
   bartime int16
)

The select statement i'm trying to run is:

    SELECT symbolID, vTrdBuy
    FROM mvTrdHidUhd 
    WHERE typeID = 1 
    AND barDateTime = 44991 
    AND symbolid in (1010,1020,1030,1040,1050,1060)

I run this query on sql2008 using sql management studio editor and enabling actual execution plan, I found that the sql uses the second index and propse to create a new index for the three columns (symbolid,bartime,typeid) but nonclustered!!! (I think it sayes non clustered index as there is already clustered one)

This selection is wrong, again I rerun the same query and forced SQL to use the clusted index (using "with index") and performance is better as it should.

I have two questions here one related to this behavior and the second for the query itself

  1. Why SQL chooses wrong index and propse the same index
  2. Which one I should use in the "where" condition for better performance

symbolid in (1010,1020,1030,1040,1050,1060)

(symbolid = 1010 or symbolid = 1020 ..etc)

(symbolid between (1010 and 1060))

After Testing

I found that when I change the where condition from using IN to use >= and <=the non clustered index on bartime column gives better performance than clustered index on 3 columns.

SO I have two cases if the WHERE uses IN it is better to use the clustered index, if it contains >= and <= it uses the second one.

A: 

Updating the statistics on the table / indexes may make it choose the correct index

ck
I will test it, but This means I have to force sql to use the correct index not to rely on its selection?
Ahmed Said
No if your statistics are correct and your indexes and queries designed properly, then you shouldn't need to force certain indexes in your query.
ck
A: 

Use symbolid BETWEEN 1010 AND 1050 if possible. The use of BETWEEN or = or >= or > or <n or <= or the combination of these with AND generally leads to better performance and better index selection than the use of OR or IN.

pts
A: 

It is possible the order of index column affects whether the optimiser will choose your index. You indicate the index is (symbolid int16,bartime int32,typeid int8) but the symbolid is the least distinct value in your where clauses. This would require 6 index lookups for the 6 values you have.

I would probably start with the between statement but only testing with your data, server, indexes etc will prove the best case.

If you are going to create another index try the 2 other orders for those columns.

And as noted elsewhere update your statistics

Karl
Choose the most restrictive column (fewest rows output) first in your where-clause.
Scoregraphic
A: 

You can also try out a covering index on (symbolid,bartime,typeid,mvTrdBuy)

AlexKuznetsov
okay but this may decrease the performance
Ahmed Said
I mean the insertion performance
Ahmed Said
+1  A: 

Your query references four columns:

  • symbolID
  • vTrdBuy
  • typeID
  • barDateTime

While the clustered index only covers three of them

  • symbolID
  • vTrdBuy
  • typeID
  • barDateTime

The reason SQL Server ignores that index is that it's useless to it. The index is first sorted by symbolID, and you don't want a specific symbolID, but a bunch of random values. This means that it has to read all over the table.

The next column in the clustered index is vTrdBuy. This isn't used to help it to skip to the rows it actually wants.

Looking at the query, two columns are very specific in limiting what rows you want to return:

WHERE typeID = 1
AND barDateTime = 44991

Creating an index that starts with typeID and barDateTime can really be useful in helping SQL Server jump to the rows that you are interested in.

First SQL Server can jump right to the rows that are

typeID = 1.

Once there, it can jump right to bars where

barDateTime = March 8, 2023

It can do this by seeking right through the index, since the index is ordered by the columns in it. This is very fast, and it's eliminated the majority of rows from being looked at.

If you were to create the index:

(
   typeID
   barDateTime
   symbolID
)

it still might not useful if the query returns a lot of rows. In order to finish the SELECT statement, SQL Server still needs the vTrdBuy value. It has to do this by jumping through the table for each one of the rows that matches the criteria (called a Bookmark Lookup). If there are too many rows (say > 500), SQL Server will just forget the index and just scan the entire table - cause it would be faster.

You want to prevent the bookmark lookup, by letting it not have to go back to the table for the missing value, you want to include the value in the index:

CREATE INDEX IX_mvTrdHidUhd_FancyCovering ON mvTrdHidUhd 
(
   typeID, barDateTime, symbolID, vTrdBuy
)

Now you have an index that contians everything SQL Server wants, in the order that it wants, and you don't have to mess with the physical sort order (i.e. clustering) of the physical table.

Ian Boyd
By definition, clustered index always covers all queries. It it wrong to state that:"While the clustered index only covers three of them * symbolID * vTrdBuy * typeID * barDateTime"
AlexKuznetsov
How to you figure that a clustered index always covers all columns?
Ian Boyd
+1  A: 
SELECT  symbolID, vTrdBuy
FROM    mvTrdHidUhd 
WHERE   typeID = 1 
        AND barDateTime = 44991 
        AND symbolid IN (1010,1020,1030,1040,1050,1060)

This condition is not covered by a single contiguous range of your clustered index.

These rows:

1010, 44991, 1
1010, 50000, 1
1020, 44991, 1

will come in order in the index, but your query will select the first and the third one, skipping the second.

SQL Server can use Clustered Index Seek if there is a limited number of predicates, like in your IN case. In this case it uses a number of ranges:

SELECT  symbolID, vTrdBuy
FROM    mvTrdHidUhd 
WHERE   (typeID = 1 
        AND barDateTime = 44991 
        AND symbolid = 1010)
        OR
        (typeID = 1 
        AND barDateTime = 44991 
        AND symbolid = 1010)
        OR …

But in case of a BETWEEN range on symbolid it cannot construct such a limited number of predicates, that's why it reverts to less efficient Clustered Index Scan (which scans on symbolid and just filters the wrong results out).

In this case your nonclustered index performs better.

You could rewrite your query like this:

SELECT  symbolID, vTrdBuy
FROM    (
        SELECT  DISTINCT symbolid
        FROM    mvTrdHidUhd 
        WHERE   symbolid BETWEEN 1010 AND 1050
        ) s
JOIN    mvTrdHidUhd m
ON      m.symbolid = s.symbolid
        AND m.typeID = 1 
        AND m.barDateTime = 44991

, which will use Clustered Index Seek on your table as well, both to build a list of DISTINCT symbolid and to join on this list.

Quassnoi
You said what i said, but with the examples of data. +1
Ian Boyd
Thank you for this good explanation, but you did not answer my first question?
Ahmed Said
@Ahmed: it chooses non-clustered index because it thinks it will be more selective. As for proposal: could you please check that SQL Server proposes exactly same column order?
Quassnoi