views:

264

answers:

5

Lately, I have been doing some reading on indexes of all types and the main advice is to put the clustered index on the primary key of the table, but what if the primary key actually is not used in a query (via a select or join) and is just put for purely relational purposes, so in this case it is not queried against. Example, say I have a car_parts table and it contains 3 columns, car_part_id, car_part_no, and car_part_title. car_part_id is the unique primary key identity column. In this case car_part_no is unique as well and is most likely car_part_title. car_part_no is what is most queried against, so doesn't it make sense to put the clustered index on that column instead of car_part_id? The basics of the question is what column should actually have the clustered index since you are only allowed one of them?

+1  A: 

The clustered index should go on the column that will be the most queried. This includes joins, as a join must access the table just like a direct query, and find the rows indicated.

You can always rebuild your indexes later on if your application changes and you find you need to optimize a table with a different index structure.

Some additional guidelines for deciding on what to cluster your table on can be found on MSDN here: Clustered Index Design Guidelines.

womp
so I would assume that my post makes sense then. Would using an inner join on the primary key count as a query on the column even if it is not included in the select list.
Xaisoft
...remembering that querying doesn't necessarily mean being uses as a criteria by end-user in their searches, but also being used in JOINs and various [implied/forgotten] subqueries.
mjv
@mjv, you read my mind.
Xaisoft
... but of course, there can be many exceptions to womp's generic (and oft' valid) rule, such as when there exist other indexes on the table that "cover" various query cases. 't is a art _and_ a science ;-)
mjv
Yes, it does count. It might be useful for you to check out Clustered Index Design Guidelines on MSDN: http://msdn.microsoft.com/en-us/library/ms190639.aspx
womp
@mjv - indeed, it is an art. That's why I wouldn't stress too much about getting it right the first time if you're just starting out on a project. If you get to the point that you need to improve performance of your DB, it's a good problem to have!
womp
+2  A: 

Keep in mind the usage patterns; If you are almost always querying the DB on the car_part_no, then it would probably be beneficial for it to be clustered on that column.

However, don't forget about joins; If you are most often joining to the table and the join uses the car_part_id field, then you have a good reason to keep the cluster on car_part_id.

Something else to keep in mind (less so in this case, but generally when considering clustered indexes) is that the clustered index will appear implicitly in every other index on the table; So for example, if you were to index car_part_title, that index will also include the car_part_id implicitly. This can affect whether or not an index covers a query and also affects how much disk space the index will take (which affects memory usage, etc).

Chris Shaffer
+2  A: 

Clustered indexes are good when you query ranges of data. For example

SELECT * FROM theTable WHERE age BETWEEN 10 AND 20

The clustered index arranges rows in the particular order on your computer disk. That's why rows with age = 10 will be next to each other, and after them there will be rows with age = 11, etc.

If you have exact select, like this:

SELECT * FROM theTable WHERE age = 20

the non-clustered index is also good. It doesn't rearrange data on your computer disk, but it builds special tree with a pointers to the rows you need.

So it strongly depends on the type of queries you perform.

Lukasz Lysik
+4  A: 

An index, clustered or non clustred, can be used by the query optimizer if and only if the leftmost key in the index is filtered on. So if you define an index on columns (A, B, C), a WHERE condition on B=@b, on C=@c or on B=@b AND C=@c will not fully leverage the index (see note). This applies also to join conditions. Any WHERE filter that includes A will consider the index: A=@a or A=@a AND B=@b or A=@a AND C=@c or A=@a AND B=@b AND C=@c.

So in your example if you make the clustred index on part_no as the leftmost key, then a query looking for a specific part_id will not use the index and a separate non-clustered index must exist on part-id.

Now about the question which of the many indexes should be the clustered one. If you have several query patterns that are about the same importance and frequency and contradict each other on terms of the keys needed (eg. frequent queries by either part_no or part_id) then you take other factors into consideration:

  • width: the clustered index key is used as the lookup key by all other non-clustered indexes. So if you choose a wide key (say two uniquidentifier columns) then you are making all the other indexes wider, thus consuming more space, generating more IO and slowing down everything. So between equaly good keys from a read point of view, choose the narrowest one as clustered and make the wider ones non-clustered.
  • contention: if you have specific patterns of insert and delete try to separate them physically so they occur on different portions of the clustered index. Eg. if the table acts as a queue with all inserts at one logical end and all deletes at the other logical end, try to layout the clustered index so that the physical order matches this logical order (eg. enqueue order).
  • partitioning: if the table is very large and you plan to deploy partioning then the partitioning key must be the clustered index. Typical example is historical data that is archived using a sliding window partitioning scheme. Even thow the entities have a logical primary key like 'entity_id', the clustred index is done by a datetime column that is also used for the partitioning function.
  • stability: a key that changes often is a poor candidate for a clustered key as each update the clustered key value and force all non-clustered indexes to update the lookup key they store. As an update of a clustered key will also likely relocate the record into a different page it can cause fragmentation on the clustered index.

Note: not fully leverage as sometimes the engine will choose an non-clustered index to scan instead of the clustered index simply because is narrower and thus has fewer pages to scan. In my example if you have an index on (A, B, C) and a WHERE filter on B=@b and the query projects C, the index will be likely used but not as a seek, as a scan, because is still faster than a full clustered scan (fewer pages).

Remus Rusanu
+2  A: 

Kimberly Tripp is always one of the best sources on insights on indexing.

See her blog post "Ever-increasing clustering key - the Clustered Index Debate - again!" in which she quite clearly lists and explains the main requirements for a good clustering key - it needs to be:

  • Unique
  • Narrow
  • Static

and best of all, if you can manage:

  • ever-increasing

Taking all this into account, an INT IDENTITY (or BIGINT IDENTITY if you really need more than 2 billion rows) works out to be the best choice in the vast majority of cases.

One thing a lot of people don't realize (and thus don't take into account when making their choice) is the fact that the clustering key (all the columns that make up the clustered index) will be added to each and every index entry for each and every non-clustered index on your table - thus the "narrow" requirement becomes extra important!

Also, since the clustering key is used for bookmark lookups (looking up the actual data row when a row is found in a non-clustered index), the "unique" requirement also becomes very important. So important in fact, that if you choose a (set of) column(s) that is/are not guaranteed to be unique, SQL Server will add a 4-byte uniquefier to each row --> thus making each and every of your clustered index keys extra wide ; definitely NOT a good thing.

Marc

marc_s
Thanks for the link marc.
Xaisoft