views:

152

answers:

7

If I had the following query:

select some cols 
   from tbl_a
INNER JOIN tbl_b ON tbl_a.orderNumber = tbl_b.orderNumber
   where tlb_b.status = 'XX'

Assuming both tables have clustered indexes on order number only, would it be better from a performance perspective to extend the clustered index on table b to include the status column referenced in the where clause?

+1  A: 

Yes I believe it would be better. One way you can tell for sure is to extend the primary key as you describe and take a look at the query plan for this query. If you don't see a scan being done, you know the extra column in the primary key is being used.

Randy Minder
+2  A: 

Yes, quite possibly. This is called a covering index. The entire query can be served from the index, without accessing tbl_b at all.

However, you should consider the impact on performance of other queries, particularly ones that update the status column.

David M
A covering index would include "some cols" as well
Andomar
In addition, a clustered index is by definiton a covering index :)
Andomar
@Andomar (2nd comment) - no, I don't think so. A "covering" index is only covering in the context of a particular query, as the index "covers" all the columns from that table used in the query.
David M
@David M: A clustered index contains all fields. There is no query a clustered index can't cover
Andomar
@Andomar - of course. Hadn't thought that one through. +1 for the comment!
David M
+1  A: 

Adding a non-sequential field like status to a clustered index will slow down writes. You'll need to decide if the performance hit on writes is more valuable than the performance gain on reads.

Also have the option of creating a second index on (ordernumber, status). You probably would benefit even more by creating an index on (status, ordernumber).

Sam
+2  A: 

Adding the status to the clustered index would allow SQL Server to resolve the where clause more efficiently. SQL Server could first look up all orders in a particular status from the index, and perform the join based on that. For that to work, the status would have to be the first column in the index:

(status, orderNumber)

Note that if you extend the primary key in this way, the orderNumber column is no longer guarantueed to be unique. So it's better to add this as a separate index.

How useful a separate index is depends on the selectiveness of the status. If you're searching for 'Failed' and only 1% of your orders have that status, the index will be very helpful. If the status is not very selective, SQL Server might not even use the new index at all.

Andomar
+1 just a quick note that even on a very low selectivity, status on a left most position would still be used. Even if there just two posible values for status (say 0,1) then an index (status, orderNumber) on tbl_b would still reduce the candidate order numbers in half, so the plan will very likely choose it. I'm intentionally leaving out the impact of 'some cols' (ie. the coverability of the projection list) because that I think is a different topic.
Remus Rusanu
+1  A: 

The MS documentation recommends:

...creating a clustered index with as few columns as possible. If a large clustered index key is defined, any nonclustered indexes that are defined on the same table will be significantly larger because the nonclustered index entries contain the clustering key.

Based on that, I would not add the status column to the clustered index, and create a separate, non-clustered index that may be a covering index if there are other columns to consider.

OMG Ponies
+2  A: 

I would not alter the primary key of the table to include a secondary column...it would be better to just add a new non-clustered index to the status field.

The reason is that a clustered index represents the physical order of the data on the disk. If you add a compound column, the table will (in some/most cases) need to be re-sorted on disk when an order is added or the status is updated. This is very expensive due to the IO and increased lock times.

jkody21
+4  A: 
  1. You extend tbl_b to add status after the orderNumber: create clustered index ... on tbl_b(orderNumber, status). For the query above there will be no noticeable difference. The plan will still have to scan tbl_b end to end and match every order number in tbl_a (probably a merge join).

  2. You extend tbl_b to add status before the orderNumber: create clustered index ... on tbl_b (status, orderNumber). Now there is a HUGE difference. The plan can do a range scan on tbl_b to get only those with Status 'xx' and only match tbl_a for the corrsponding orderNumber, using a nested loop join.

Placing a low selectivity column (like 'status' usually is) as the leftmost key in an index is usually a good thing. And making a row like 'status' the leftmost column in a clustered index is also usually a good thing, because it groups records with same status together physically. Note that doing so will have an impact on all queries. You also loose the direct access by orderNumber if status is not specified, you'll have to add a non-clustered index on orderNumber alone to cover that (which is usualy the PK non-clustered index).

I made all these comments w/o knowing your actual data cardinality and selectivity. If the cardinality of tbl_a and tbl_b is very skewed then things may be different. Eg. if tbl_a has 10 records with 10 distinct order numbers and tbl_b has 10M records with 10M order numbers than my advice the option 2. would make little difference, since the plan will always choose a scan of tbl_a a seek range lookups in tbl_b 10 times.

Remus Rusanu
Thanks for the answer. If I make the clustered index orderNumber only, can I add a non-clustered index with status only rather than status, orderNumber (as the clustered index is incorporated in the non-clustered index)?
SuperCoolMoss
a non clustered index on (status) will have little use. You should make the nonclustered index on (status, orderNumber) imho.
Remus Rusanu