views:

149

answers:

3

I have a table storing millions of rows. It looks something like this:

Table_Docs
ID, Bigint (Identity col)
OutputFileID, int
Sequence, int
…(many other fields)

We find ourselves in a situation where the developer who designed it made the OutputFileID the clustered index. It is not unique. There can be thousands of records with this ID. It has no benefit to any processes using this table, so we plan to remove it.

The question, is what to change it to… I have two candidates, the ID identity column is a natural choice. However, we have a process which does a lot of update commands on this table, and it uses the Sequence to do so. The Sequence is non-unique. Most records only contain one, but about 20% can have two or more records with the same Sequence.

The INSERT app is a VB6 piece of crud throwing thousands insert commands at the table. The Inserted values are never in any particular order. So the Sequence of one insert may be 12345, and the next could be 12245. I know that this could cause SQL to move a lot of data to keep the clustered index in order. However, the Sequence of the inserts are generally close to being in order. All inserts would take place at the end of the clustered table. Eg: I have 5 million records with Sequence spanning 1 to 5 million. The INSERT app will be inserting sequence’s at the end of that range at any given time. Reordering of the data should be minimal (tens of thousands of records at most).

Now, the UPDATE app is our .NET star. It does all UPDATES on the Sequence column. “Update Table_Docs Set Feild1=This, Field2=That…WHERE Sequence =12345” – hundreds of thousands of these a day. The UPDATES are completely and totally, random, touching all points of the table.

All other processes are simply doing SELECT’s on this (Web pages). Regular indexes cover those.

So my question is, what’s better….a unique clustered index on the ID column, benefiting the INSERT app, or a non-unique clustered index on the Sequence, benefiting the UPDATE app?

+3  A: 

First off, I would definitely recommend to have a clustered index!

Secondly, your clustered index should be:

  • narrow
  • static (never or hardly ever change)
  • unique
  • ever-increasing

so an INT IDENTITY is a very well thought out choice.

When your clustering key is not unique, SQL Server will add a 4-byte uniqueifier to those column values - thus making your clustering key and with it all non-clustered indices on that table larger and less optimal.

So in your case, I would pick the ID - it's narrow, static, unique and ever-increasing - can't be more optimal than that! Since the Sequence is used heavily in UPDATE statements, definitely put a non-clustered index on it, too!

See Kimberly Tripp's excellent blog posts on choosing the right clustering key for great background info on the topic.

marc_s
Good answer. I saw the banner saying an answer was added as I was hitting submit.
TimothyAWiseman
Thanks. I'm going to read that link this weekend. I want to make an informed decision. I should state that my "Sequence" clustered index would be:•narrow (It's an int)•static (never changes)•almost unique (very limited duplicates 10-20% of all records at most and limited to under 5 rows for each duplicate)•ever-increasingIs this an exception to the rule, considering what we are doing?
Bremer
@Bremer: if your clustering key is not ever-increasing, you'll have to deal with page splits when you insert a new row into the middle of a full page --> not so good for performance.
marc_s
After reading all the great suggestions here, I went with the ID column. It would seem that overall, reducing page splits and database fragmentation would be the best result for the system as a whole, even though one app using the Sequence column might suffer slightly. Thanks everyone for the excellent information.
Bremer
+2  A: 

As a general rule, you want your clustered index to be unique. If it is not, SQL Server will in fact add a hidden "uniquifier" to it to force it to be unique, and this adds overhead.

So, you are probably best using the ID column as your index.

Just as a side note, using a identity column as your primary key is normally referred to as a surrogate key since it is not inherent in your data. When you have a unique natural key available that is probably a better choice. In this case it looks like you do not, so using the unique surrogate key makes sense.

TimothyAWiseman
I know that this is the general recommendation, but I see a unique case here. Is this scenario possibly one of the onces where the "general rule" does not apply?
Bremer
I do not see why this case is unique. As to whether the general rule applies, from what has been described so far, I would say yes. Really determining this in detail would require extensive testing with your exact application, all of it. But all details so far would indicate that the ID column is the way to go.
TimothyAWiseman
+1  A: 

The worst thing about the inserts out of order is page splits.

When SQL Server needs to insert a new record into an existing index page and finds no place there, it takes half the records from the page and moves them into a new one.

Say, you have these records filling the whole page:

1 2 3 4 5 6 7 8 9

and need to insert a 10. In this case, SQL Server will just start the new page.

However, if you have this:

1 2 3 4 5 6 7 8 11

, 10 should go before 11. In this case, SQL Server will move records from 6 to 11 into the new page:

6 7 8 9 10 11

The old page, as it can be easily seen, will remain half filled (only records from 1 to 6 will go there which are very).

This will increase the index size.

Let's create two sample tables:

CREATE TABLE perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))
CREATE TABLE almost_perfect (id INT NOT NULL PRIMARY KEY, stuffing VARCHAR(300))

;
WITH    q(num) AS
        (
        SELECT  1
        UNION ALL
        SELECT  num + 1
        FROM    q
        WHERE   num < 200000
        )
INSERT
INTO    perfect
SELECT  num, REPLICATE('*', 300)
FROM    q
OPTION (MAXRECURSION 0)

;
WITH    q(num) AS
        (
        SELECT  1
        UNION ALL
        SELECT  num + 1
        FROM    q
        WHERE   num < 200000
        )
INSERT
INTO    almost_perfect
SELECT  num + CASE num % 5 WHEN 0 THEN 2 WHEN 1 THEN 0 ELSE 1 END, REPLICATE('*', 300)
FROM    q
OPTION (MAXRECURSION 0)

EXEC sp_spaceused N'perfect'
EXEC sp_spaceused N'almost_perfect'

perfect         200000   66960 KB    66672 KB    264 KB  24 KB
almost_perfect  200000   128528 KB   128000 KB   496 KB  32 KB

Even with only 20% probability of the records being out of order, the table becomes twice as large.

On the other hand, having a clustered key on Sequence will reduce the I/O twice (since it can be done with a single clustered index seek rather than two unclustered ones).

So I'd take a sample subset of your data, insert it into the test table with a clustered index on Sequence and measure the resulting table size.

If it less than twice the size of the same table with an index on ID, I'd go for the clustered index on Sequence (since the total resulting I/O will be less).

If you decide to create a clustered index on Sequence, make ID an unclustered PRIMARY KEY and make the clustered index UNIQUE on Sequence, ID. This will use a meaningful ID instead of opaque uniquiefier.

Quassnoi