views:

337

answers:

5

Assume I have a table called "table" and I have 3 columns, a, b, and c.

What does it mean to have a non-clustered index on columns a,b?

Is a nonclustered index on columns a,b the same as a nonclustered index on columns b,a? (Note the order).

Also, Is a nonclustered index on column a the same as a nonclustered index on a,c?

I was looking at the website sqlserver performance and they had these dmv scripts where it would tell you if you had overlapping indexes and I believe it was saying that having an index on a is the same as a,b, so it is redundant. Is this true about indexes?

One last question is why is the clustered index put on the primary key. Most of the time the primary key is not queried against, so shouldn't the clustered index be on the most queried column. I am probably missing something here like having it on the primary key speeds up joins?

Great explanations. Should I turn this into a wiki and change the title index explanation?

+2  A: 

A non-clustered index on (a, b) is a "copy" of a part of the table whose rows are sorted on a then on b and contain the reference to the original row.

It helps to run the queries like this:

SELECT  *
FROM    mytable
WHERE   a = @A
        AND b = @B

, this:

SELECT  *
FROM    mytable
ORDER BY
        a, b

, this:

SELECT  *
FROM    mytable
WHERE   a = @A
ORDER BY
        b

and many others.

For instance, we have a table like this:

#       col1    col2    col3
1       1       1       1
2       1       4       8
3       7       2       3
4       3       3       9
5       8       9       4
6       2       2       7
7       5       3       5
8       3       9       4

If we create an index on (col2, col3), it will contain the following data:

col2    col3    #
1       1       1
2       3       3
2       7       6
3       5       7
3       9       4
4       8       2
9       4       5
9       4       8

, i. e. sorted first on col2, then on col3, then on the reference to the row.

It's easy to see that this index is an index on col2 just as well (sorting on (col2, col3) implies sorting on col2 alone).

Order matters, so if we create an index on (col3, col2), the rows will be sorted differently:

col2    col3    #
1       1       1
2       3       3
9       4       5
9       4       8
3       5       7
2       7       6
4       8       2
3       9       4

This index is an index on col3 too.

If we want to find the rows within a certain range of (col2, col3) we just take a slice from the ordered data:

SELECT  col2, col3
FROM    mytable
WHERE   col2 BETWEEN 2 AND 3

col2    col3    #
1       1       1
----
2       3       3
2       7       6
3       5       7
3       9       4
----
4       8       2
9       4       5
9       4       8

Easy to see that we cannot take this slice on col3 using this index, since col3 is not ordered by itself.

The "reference" mentioned above is a RID of the row (a pointer to the place in the tablespace), if the table is non-clustered itself, or the value of the table's cluster key if the table is clustered.

A clustered index does not create a shadow copy of the values. Instead, it rearranges the tables rows themselves.

If you create a clustered index on (col2, col3) above, it will just rearrange the table rows:

#       col1    col2    col3
1       1       1       1
3       7       2       3
6       2       2       7
7       5       3       5
4       3       3       9
2       1       4       8
5       8       9       4
8       3       9       4

Clustered or non-clustered, therefore, is a storage method rather than an index.

In Oracle, this is called index-organized table (rows are sorted), as opposed to a heap-organized table (rows are not sorted).

Quassnoi
+2  A: 

index A,B is different from index B,A

That is because an index is organized along a particular sort order. So imagine you need to search with the following WHERE clause

WHERE A='somecrit' AND B='SomepartialCrit%'  -- notice the wildcard

The A,B index will be very efficient at resolving the query, but if it were

WHERE   A='SomepartialCrit%'  AND B='somecrit'

The (A,B) index would only partially help (could be better than full table scan but not optimal..) whereby the (B,A) index would come to the rescue...

For use with a query that included both A and B as exact match (no wildcard), either index could be used in an equivalent fashion (efficiency-wise), although the choice of one particular index could be driven by other part of the query such as ORDER BY clauses etc.

An index on A is different from an index on A,C For one the index on A,C could be used to resolve queries that involve both A and C criteria, and also the A,C index could be use to "cover" the SELECT clause or part thereof, that is: If the SELECT clause only includes column A and C (from this particular table), SQL could provide the results without having to get data from the table proper, it would get the A and C values from the index alone.

Are "redundant" indexes a bad thing ?

As said above, extra indexes may help resolve SELECT queries more efficiently. On the down side, they a) use storage space and b) make the INSERT, UPDATE and DELETE queries less efficient (because the new/updated/deleted values need to be added/changed/deleted in more places.

It is therefore a matter of finding the right balance based on the available storage space available and the use case (some mostly read-only databases can have a slew of indexes without hurting performance at all, databases with frequent inserts can see their performance degrade terribly with too many indices)

On clustered indexes

See explanation by Joel Coehoorn.
No, the clustered index of a given table does not need to be based on the primary key. Selecting a good clustered index (or indeed deciding to not use a clustered index) is a part science part art process which scope is beyond this short response.

mjv
+9  A: 

This is turning into a more-general introduction to indexing, but I suspect you'll still find it useful. The first two paragraphs especially speak to your question.

Clustered vs Non-clustered

This refers to how the table is physically arranged on disk. A clustered index works by sorting the physical pages and rows in a table on disk based on the index definition. Non-clustered indexes actually store a copy of the columns in the index (and only those columns), plus a pointer to the actual records, in a separate location. For this reason, clustered indexes are often faster because they will always cover any data you need in the query. However, you only get one of them because otherwise you'd duplicate the entire table. It's also important to know that adding non-clustered indexes to a table actually slows down write operations like inserts and updates, because the database has to rebuild the index.

Index Order

An index on (a,b) is not the same as on (b,a). If the first case, records in the index are ordered by column 'a' first, and column 'b' only effects the index order when you have duplicate values for 'a'. Searching the index with a column 'b' value only won't help you. In the second case, the reverse happens: records are ordered by column 'b' first, and column 'a' only helps when you have duplicate values for 'b'. Searching that index with a column 'a' value only won't help you.

Covering Indexes

Sometimes a database can fulfill the requirements of a query entirely from an index. In this case, the index is said to be a "covering" index for that query. This is advantageous because indexes are often cached in memory, and so the database may not have to go do disk at all. To understand this, imagine an index on (a,b) where there are very few duplicate values for 'a'. Including 'b' in the index seems wasteful, unless you have a query that runs often that looks for a particular value of 'a' and also needs 'b'. This index will now save a lot work going back to the original table to retrieve 'b'.

Selectivity

Selectivity is a value from 0 to 1 (often expressed as a percentage) that tell you how unique each value in an index is. A selectivity of 1 or 100% means there are no duplicates. A selectivity of 0 means there is only one value in the column. Generally, a higher selectivity (approaching 1) is better for indexes.

To demonstrate this, think about what would happen with a low-selectivity index. For example, you try to speed up a query by adding an index to a bit column in a table with 10000 records. In this case (assuming uniform distribution), the selectivity is .5. You run your query, and the index returns 5000 records. But each of those records still has to go back to the original table, and since the index order doesn't match the table order it would have to do a lot of separate lookups into the table. Instead, it's likely faster to just scan through the entire table start to finish to retrieve the needed data.

Selectivity explains why you would want to cluster on the primary key. Since the clustered index tells the database how to order the table, going for anything less than 100% selectivity here means a query will have to scan the table more often. Clustering on the primary key gives you perfect selectivity. And since this primary key is often used as the record pointer in other indexes, you want to keep it as small as possible (ie, an integer identity column).

There's a good article on the selectivity and indexing here:
http://www.akadia.com/services/ora_index_selectivity.html

Sargable

This refers to whether the database is able to use a particular filter with an index.

As we have shown, indexes normally work by first sorting the data into a specific order, so that lookups into that index can use something efficient like a tree-based search rather than a slower linear search. So anything that can't be effectively compared with sorted data can't be used with an index. A good example is the LIKE operator. This is sargable:

SELECT * FROM [Table] WHERE [Column] LIKE @Value + '%'

but this is not sargable:

SELECT * FROM [Table] WHERE [Column] LIKE '%' + @Value + '%'

Some other things that can make a filter un-sargable are non-deterministic functions (and there are more of those than you think).

Joel Coehoorn
Great explanation. @Xaisoft - make sure the indexes that exist match how you query the data. Look to table joins or where statements and compare them to those indexes.
Michael La Voie
If anyone can think of anything I'm missing that ought to go in a 2-page indexing introduction, let me know. I'm thinking about saving a link to this post or re-writing it on my blog for future use.
Joel Coehoorn
Great stuff! It's good to always serve as inspiration for someones blog :)
Xaisoft
Joel, some examples to go with each section would be helpful.
Xaisoft
+2  A: 

Is a nonclustered index on columns a,b the same as a nonclustered index on columns b,a? (Note the order).

NO! Order is important. If you have a non-clustered index on (a,b), you can use this if your WHERE clause has a restriction on a and b - or if it has just a restriction on a (but not if it has just a check against b).

Also, Is a nonclustered index on column a the same as a nonclustered index on a,c?

No it's not - but the SQL Server query optimizer will use this non-clustered index if it encounters a query with a WHERE clause just on "a".

Marc

marc_s
+2  A: 

Think of the index as a phonebook. Usually phonebooks are ordered by lastname, firstname, street. So if you want to find the phone number of Joe Smith, 101 Main Street, you open the phone book at S for Smith, then you look up all Joes under Smith, look for the Joe Smith who lives at 101 Main Street, and you find the phone number.

The phone book could be ordered differently, say by street, firstname, lastname. Then you'd look up Main Street first, then Joe, and finally Smith. If you only want to find one person's number, that would be equally fast.

The difference becomes important if you want to list the phone numbers of all people who live in on Main Street and whose first name is Joe. With a normal phone book this is a drudge: you have to loop over all last names, find out the Joes with that last name, and whether they live on Main Street. To do that, you have to look through the entire phone book. But if the index order is street, firstname, lastname, the task is almost trivial: Look up Main Street, Joe, and copy all lastnames and their phone numbers. Way faster.

Also, the fact that phone books list streets, too, is irrelevant if you are only interested in the names. If you want to find the phone numbers of all Joe Smiths, you'd need a phone book ordered by lastname, firstname (or firstname, lastname). You don't care if the phone book has all Joe Smiths ordered by street or not. In that sense, an index on (lastname, firstname, street) encompasses an index on (lastname, firstname).

So: index (a,b,c) is not equal to (c,a,b), and if you have (a,c) you don't need another (a)

wallenborn