views:

162

answers:

3

I've heard that you should put columns that will be the most selective at the beginning of the index declaration. Example:

CREATE NONCLUSTERED INDEX MyINDX on Table1
(
   MostSelective,
   SecondMost,
   Least
)

First off, is what I'm saying correct? If so, am i likely to see large differences in performance by rearranging the order of the columns in my index or is it more of a "nice to do" practice?

The reason I'm asking is because after putting a query through the DTA it recommended that I create an index that had almost all of the same columns in it as an existing index, just in a different order. I was considering just adding the missing columns to the existing index and calling it good. Throughts?

+5  A: 

you should put columns that will be the most selective at the beginning of the index declaration.

Correct. Indexes can be composites - composed of multiple columns - and the order is important because of the leftmost principle. Reason is, that the database checks the list from left to right, and has to find a corresponding column reference matching the order defined. For example, having an index on an address table with columns:

  • Address
  • City
  • State

Any query using the address column can utilize the index, but if the query only has either city and/or state references - the index can not be used. This is because the leftmost column isn't referenced. Query performance should tell you which is optimal - individual indexes, or multiple composites with different orders. Good read: The Tipping Point, by Kimberley Tripp

OMG Ponies
What if it was only the rightmost column that wasn't being used? So a query used Address and city, but NOT state. Would the index be used then?
Abe Miessler
@Abe: Rightmost would not be used - you have to satisfy the index order starting from the left. Miss one, can't use it.
OMG Ponies
+7  A: 

Look at an index like this:

Cols
  1   2   3
-------------
|   | 1 |   |
| A |---|   |
|   | 2 |   |
|---|---|   |
|   |   |   |
|   | 1 | 9 |
| B |   |   |
|   |---|   |
|   | 2 |   |
|   |---|   |
|   | 3 |   |
|---|---|   |

See how restricting on A first, as your first column eliminates more results than restricting on your second column first? It's easier if you picture how the index must be traversed across, column 1, then column 2, etc...you see that lopping off most of the results in the fist pass makes the 2nd step that much faster.

Another case, if you queried on column 3, the optimizer wouldn't even use the index, because it's not helpful at all in narrowing down the result sets. Anytime you're in a query, narrowing down the number of results to deal with before the next step means better performance.

Since the index is also stored this way, there's no backtracking across the index to find the first column when you're querying on it.

In short: No, it's not for show, there are real performance benefits.

Nick Craver
Good info, thanks nick!
Abe Miessler
+3  A: 

The order of columns is critical. Now which order is correct it depends on how you are going to query it. An index can be used to do an exact seek or an range scan. An exact seek is when values for all columns in the index are specified and the query lands exactly on the row is interested in. For seeks the order of columns is irelevant. A range scan is when only some columns are specified, and in this case whenthe order becomes important. SQL Server can use a an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on. If you have an index on (A,B,C) it can be used to range scan for A=@a, for A=@a AND B=@b bit not for B=@b, for C=@c norB=@b AND C=@c. The case A=@a AND C=@c is mixed one, as in the A=@a portion will use the index, but the C=@c not (the query will scan all B values for A=@a, will no 'skip' to C=@c). Other database systems have the so called 'skip scan' operator that can take some advantage of iner columns in an index when the outer columns is not specified.

With that knowledge in hand you can look at the index definitions again. An index on (MostSelective, SecondMost, Least) will be effective only when MostSelective column is specified. But that being the most selective, the relevance of the inner columns will quikly degrade. Very often you'll find that a better index is on (MostSelective) include (SecondMost, Least) or on (MostSelective, SecondMost) include (Least). Because the inner columns are less relevant, placing low selectivity columns in such right positions in the index make them nothing but noise for a seek, so it makes sense to move them out of the intermediate pages and keep them only on the leaf pages, for query coverability purposes. In other words, move them to INCLUDE. This becomes more important as the size of Least column increases. The idea is that this index can only benefit queries that specify MostSelective either as an exact value or a range, and that column being the most selective it already restricts the candidate rows to great extent.

On the other hand an index on (Least, SecondMost, MostSelective) may seem a mistake, but it actually quite a powerful index. Because it has the Least column as its outermost query, it can be used for queries that have to aggregate results on low selectivity columns. Such queries are prevalent in OLAP and analysis data warehouses, and this is exactly where such indexes have a very good case going for them. Such indexes actually make excelent clustered indexes, exactly because they organize the physical layout on large chunks of related rows (same Least value, which usually indicate some sort of category or type) and they facilitate analisys queries.

So, unfortunately, there is no 'correct' order. You shouldn't follow any cookie cutter recipe but instead analyze the query pattern you are going to use against those tables and decide which index column order is right.

Remus Rusanu
Awesome response as usual Remus. I'm going to read over your third paragraph a few more times and follow up. I suspect that may be exactly what I need to do.
Abe Miessler