views:

252

answers:

4

Usually the clustered index is created in SQL Server Management Studio by setting the primary key, however my recent question about PK <-> clustered index (http://stackoverflow.com/questions/2262998/meaning-of-primary-key-to-microsoft-sql-server-2008) has shown that it is not necessary to set PK and clustered index to be equal.

So how should we choose clustered indexes then? Let's have the following example:

create table Customers (ID int, ...) create table Orders (ID int, CustomerID int)

We would usually create the PK/CI on both ID columns but i thought about creating it for Orders in CustomerID. Is that the best choice?

A: 

Order.CustomerID won't be unique so it won't qualify alone for CI.

Try instead CI (CustomerID, ID). This way, records will be grouped first by CustomerID (which is good) then by ID (which may not be good if IDs are random). In case ID only gets incremented (like it were IDENTITY) it will perform better for inserts.

Developer Art
A clustered index does not need to be unique. I think you are confusing clustered index with primary key / unique constraint.
Aaron Bertrand
Actually, it does need to be unique. If you don't make it unique, the engine will add some integer value to "uniquify" it behind the scenes. I read some time ago an article where a guy tried to generate so many records with the same column value to "overflow" this integer value to cause the SQL Server engine to report errors. And he succeeded.
Developer Art
+1  A: 

If you're concerned about clustering it's usually to help improve data retrieval. In you example, you're probably going to want all records for a given customer at once. Clustering on customerID will keep those rows on the same physical page rather than scattered throughout multiple pages in your file.

ROT: Cluster on what you want to show a collection of. Line items in a purchase order is the classic example.

No Refunds No Returns
Line Items on an PO might be good idea for a cluster, but not if there's only 2 or 3 (or a dozen) line items on the typical order. Unless the rows you are clustering together starts to get into the dozens or hundreds, then it's better just to let SQL Server perform the bookmark lookup. i had a system where business requirement had to find all the "line items" that happened during a particular cashier's shift (to see if they balanced). Denormalizing the "line items" with the `id` if the **Shift**, and then clustering on **Shift** was a huge speed boost.
Ian Boyd
+2  A: 

A best candidate for a CLUSTERED index is the key you use to refer to your records most often.

Usually, this is a PRIMARY KEY, since it's what used in searches and/or FOREIGN KEY relationships.

In your case, Orders.ID will most probably participate in the searches and references, so it is the best candidate for being a clustering expression.

If you create the CLUSTERED index on Orders.CustomerID, the following things will happen:

  1. CustomerID is not unique. To ensure uniqueness, a special hidden 32-bit column known as uniquifier will be added to each record.

  2. Records in the table will be stored according to this pair of columns (CustomerID, uniquifier).

  3. A secondary index on Order.ID will be created, with (CustomerID, uniquifier) as the record pointers.

  4. Queries like this:

    SELECT  *
    FROM    Orders
    WHERE   ID = 1234567
    

    will have to do an external operation, a Clustered Seek, since not all columns are stored in the index on ID. To retrieve all columns, the record should first be located in the clustered table.

This additional operation requires IndexDepth as many page reads as a simple Clustered Seek, the IndexDepth beign O(log(n)) of total number of the records in your table.

Quassnoi
+4  A: 

According to The Queen Of Indexing - Kimberly Tripp - what she looks for in a clustered index is primarily:

  • Unique
  • Narrow
  • Static

And if you can also guarantee:

  • Ever-increasing pattern

then you're pretty close to having your ideal clustering key!

Check out her entire blog post here, and another really interesting one about clustering key impacts on table operations here: The Clustered Index Debate Continues.

Anything like an INT (esp. an INT IDENTITY) or possibly an INT and a DATETIME are ideal candiates. For other reasons, GUID's aren't good candidates at all - so you might have a GUID as your PK, but don't cluster your table on it - it'll be fragmented beyond recognition and performance will suffer.

marc_s