views:

250

answers:

5

I have a series of questions about Keys, Indexes and Constraints in SQL, SQL 2005 in particular. I have been working with SQL for about 4 years but I have never been able to get definitive answers on this topic and there is always contradictory info on blog posts, etc. Most of the time tables I create and use just have an Identity column that is a Primary Key and other tables point to it via a Foreign Key.

With join tables I have no Identity and create a composite Primary Key over the Foreign Key columns. The following is a set of statements of my current beliefs, which may be wrong, please correct me if so, and other questions.

So here goes:

As I understand it the difference between a Clustered and Non Clustered Index (regardless of whether it is Unique or not) is that the Clustered Index affects the physical ordering of data in a table (hence you can only have one in a table), whereas a Non Clustered Index builds a tree data structure. When creating Indexes why should I care about Clustered vs Non Clustered? When should I use one or the other? I was told that inserting and deleting are slow with Non-Clustered indexes as the tree needs to be "rebuilt." I take it Clustered indexes do not affect performance this way?

I see that Primary Keys are actually just Clustered Indexes that are Unique (do they have to be clustered?). What is special about a Primary Key vs a Clustered Unique Index?

I have also seen Constraints, but I have never used them or really looked at them. I was told that the purpose of Constraints is that they are for enforcing data integrity, whereas Indexes are aimed at performance. I have also read that constraints are acually implemented as Indexes anyway so they are "the same." This doesnt sound right to me. How are constraints different to Indexes?

A: 

I don't have time to answer this in depth, so here is some info off the top of my head:

You're right about clustered indexes. They rearrange the physical data according to the sort order of the clustered index. You can use clustered indexes specifically for range-bound queries (e.g. between dates).

PKs are by default clustered, but they don't have to be. That's just a default setting. The PK is supposed to be a UID for the row.

Constraints can be implemented as indexes (for example, unique constraints), but can also be implemented as default values.

Randolph Potter
+1  A: 

Your assumption about the clustered vs non-clustered is pretty good

It also seems that primary key enforces non null uniquenes, while the unique index does not enforce non null primary vs unique

astander
+1  A: 

The primary key is a logical concept in relational database theory - it's a key (and typically also an index) which is designed to uniquely identify any of your rows. Therefore it must be unique and it cannot be NULL.

The clustering key is a storage-physical concept of SQL Server specifically. It's a special index that isn't just used for lookups etc., but also defines the physical structure of your data in your table. In a printed phonebook in Western European culture (except maybe for Iceland ), the clustered index would be "LastName, FirstName".

Since the clustering index defines your physical data layout, you can only ever have one of those (or none - not recommended, though).

Requirements for a clustering key are:

  • must be unique (if not, SQL Server will add a 4-byte "uniqueifier")
  • should be stable (never changing)
  • should be as small as possible (INT is best)
  • should be ever-increasing (think: IDENTITY)

SQL Server makes your primary key the clustering key by default - but you can change that if you need to. Also, mind you: the columns that make up the clustering key will be added to each and every entry of each and every non-clustered index on your table - so you want to keep your clustering key as small as possible. This is because the clustering key will be used to do the "bookmark lookup" - if you found an entry in a non-clustered index (e.g. a person by their social security number) and now you need to grab the entire row of data to get more details, you need to do a lookup, and for this, the clustering key is used.

There's a great debate about what makes a good or useful clustering and/or primary key - here's a few excellent blog posts to read about this:

Marc

marc_s
+2  A: 

Clustered indexes are, as you put it correctly, the definition as to how data in a table is stored physically, i.e. you have a B-tree sorted using the clustering key and you have the data at the leaf level.

Non-clustered indexes on the other hand are separate tree structures which at the leaf level only have the clustering key (or a RID if the table is a heap), meaning that when you use a non-clustered index, you'll have to use the clustered index to get the other columns (unless your request is fully covered by the non-clustered index, which can happen if you request only the columns, which constitute the non-clustered index key columns).

When should you use one or the other ? Well, since you can have only one clustered index, define it on the columns which makes most sense, i.e. when you look up clients by ID most of the time, define a clustered index on the ID. Non-clustered indexes should be defined on columns which are used less often.

Regarding performance, inserts or updates that change the index key are always painfull, regardless of whether it is a clusted on non-clustered index, since page splits can happen, which forces data to be moved between pages (moving the pages of a clustered index hurts more, since you have more data in the leaf level). Thus the general rule is to avoid changing the index key and inserting new values so that they would be sequencial. Otherwise you'll encounter fragmentation and will have to rebuild your index on a regular basis.

Finally, regarding constraints, by definition, they have nothing to do with indexes, yet SQL server has chosen to implement them using indexes. E.g. currently, a unique constraint is implemented as an index, however this can change in a future version (though I doubt that will happen). The type of index (clustered or not) is up to you, just remember that you can have only one clustered index.

If you have more questions of this type, I highly recommend reading this book, which covers these topics in depth.

Saulius
+1  A: 

You have several questions. I'll break some of them out:

When creating Indexes why should I care about Clustered vs Non Clustered?

Sometimes you do care how the rows are organized. It depends on your data and how you will use it. For example, if your primary key is a uniqueidentifier, you may not want it to be CLUSTERED, because GUID values are essentially random. This will cause SQL to insert rows randomly throughout the table, causing page splits which hurt performance. If your primary key value will always increment sequentially (int IDENTITY for example), then you probably want it to be CLUSTERED, so your table will always grow at the end.

A primary key is CLUSTERED by default, and most of the time you don't have to worry about it.

I was told that inserting and deleting are slow with Non-Clustered indexes as the tree needs to be "rebuilt." I take it Clustered indexes do not affect performance this way?

Actually, the opposite can be true. NONCLUSTERED indexes are kept as a separate data structure, but the structure is designed to allow some modification without needing to be "re-built". When the index is initially created, you can specify the FILLFACTOR, which specifies how much free space to leave on each page of the index. This allows the index to tolerate some modification before a page split is necessary. Even when a page split must occur, it only affects the neighboring pages, not the entire index.

The same behavior applies to CLUSTERED indexes, but since CLUSTERED indexes store the actual table data, page splitting operations on the index can be much more expensive because the whole row may need to be moved (versus just the key columns and the ROWID in a NONCLUSTERED index).

The following MSDN page talks about FILLFACTOR and page splits: http://msdn.microsoft.com/en-us/library/aa933139%28SQL.80%29.aspx

What is special about a Primary Key vs a Clustered Unique Index? How are constraints different to Indexes?

For both of these I think it's more about declaring your intentions. When you call something a PRIMARY KEY you are declaring that it is the primary method for identifying a given row. Is a PRIMARY KEY physically different from a CLUSTERED UNIQUE INDEX? I'm not sure. The behavior is essentially the same, but your intentions may not be clear to someone working with your database.

Regarding constraints, there are many types of constraints. For a UNIQUE CONSTRAINT, there isn't really a difference between that and a UNIQUE INDEX, other than declaring your intention. There are other types of constraints that do not map directly to a type of index, such as CHECK constraints, DEFAULT constraints, and FOREIGN KEY constraints.

Brannon