views:

164

answers:

4

I have a table in SQL Server, where i want inserts to be added to the end of the table (as opposed to a clustering key that would cause them to be inserted in the middle). This means I want the table clustered by some column that will constantly increase.

This could be achieved by clustering on a datetime column:

CREATE TABLE Things (
    ...
    CreatedDate datetime DEFAULT getdate(),
    [timestamp] timestamp,        

    CONSTRAINT [IX_Things] UNIQUE CLUSTERED (CreatedDate)
)

But I can't guaranteed that two Things won't have the same time. So my requirements can't really be achieved by a datetime column.

I could add a dummy identity int column, and cluster on that:

CREATE TABLE Things (
    ...
    RowID int IDENTITY(1,1),
    [timestamp] timestamp,        

    CONSTRAINT [IX_Things] UNIQUE CLUSTERED (RowID)
)

But you'll notice that my table already constains a timestamp column; a column which is guaranteed to be a monotonically increasing. This is exactly the characteristic I want for a candidate cluster key.

So I cluster the table on the rowversion (aka timestamp) column:

CREATE TABLE Things (
    ...
    [timestamp] timestamp,        

    CONSTRAINT [IX_Things] UNIQUE CLUSTERED (timestamp)
)

Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.

What I'm looking for are thoughts of why this is a bad idea; and what other ideas are better.

Note: Community wiki, since the answers are subjective.

A: 

You were on the right track already. You can use a DateTime column that holds the created date and create a CLUSTERED but non unique constraint.

CREATE TABLE Things (
    ...
    CreatedDate datetime DEFAULT getdate(),
    [timestamp] timestamp,        
)

CREATE CLUSTERED INDEX [IX_CreatedDate] ON .[Things] 
(
    [CreatedDate] ASC
)
ntziolis
A: 

If this table gets a lot of inserts, you might be creating a hot spot that interferes with updates, because all of the inserts will be happening on the same physical/index pages. Check your locking setup.

DaveE
hotspots really aren't a big issue anymore, ever since SQL Server 7.0
marc_s
Granted, it's not a big concern, but if he wants to force all of his inserts into one small part of the table it's something to be aware of.
DaveE
The real point is that i don't want to cluster by a newid() `uniqueidentifier`. i don't want to cluster by other values for the same reason.
Ian Boyd
+2  A: 

from the link: timestamp in the question:

The timestamp syntax is deprecated. This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.

and

Duplicate rowversion values can be generated by using the SELECT INTO statement in which a rowversion column is in the SELECT list. We do not recommend using rowversion in this manner.

so why on earth would you want to cluster by either, especially since their values alwsys change when the row is updated? just use an identity as the PK and cluster on it.

KM
`rowversion` is the new name for `timestamp`. i specifically decided to use the term `timestamp` since more people will know what i'm talking about. But the two are identical. And duplicate `rowversion`s can only be inserted if you're inserting duplicate `rowversion`s. Which in reality nobody ever does. In the end your final throw away statement is a valid reason: they change on updates, which would shuffle rows around.
Ian Boyd
+3  A: 

So I cluster the table on the rowversion (aka timestamp) column: Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.

That might sound like a good idea at first - but it's really almost the worst option you have. Why?

The main requirements for a clustered key are (see Kim Tripp's blog post for more excellent details):

  • stable
  • narrow
  • unique
  • ever-increasing if possible

Your rowversion violates the stable requirement, and that's probably the most important one. The rowversion of a row changes with each modification to the row - and since your clustering key is being added to each and every non-clustered index in the table, your server will be constantly updating loads of non-clustered indices and wasting a lot of time doing so.

In the end, adding a dummy identity column probably is a much better alternative for your case. The second best choice would be the datetime column - but here, you do run the risk of SQL Server having to add "uniqueifiers" to your entries when duplicates occur - and with a 3.33ms accuracy, this could definitely be happening - not optimal, but definitely much better than the rowversion idea...

marc_s
+1,000,000,000,000
KM
+1 for the answer. And your answer should have another +1 for the opening sentence; it's really perfectly worded to draw in the reader. And it's tactful and polite.
Ian Boyd
@Ian Boyd: thank you for your very kind words!
marc_s