views:

193

answers:

4

I'm not a DBA ("Good!", you'll be thinking in a moment.)

I have a table of logging data with these characteristics and usage patterns:

  • A datetime column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique
  • Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
  • Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
  • No updates at all
  • Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
  • Infrequent selects using other columns as the criteria (and not including the timestamp column)
  • A good amount of data, but nowhere near enough that I'm worried much about storage space

Additionally, there is currently a daily maintenance window during which I could do table optimization.

I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.

I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).

But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.

All of this suggests to me that I should:

  • Put a clustered index on the timestamp column with a 100% fill-factor
  • Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
  • Schedule the bulk deletes to occur during the daily maintenance interval
  • Schedule a rebuild of the clustered index to occur immediately after the bulk delete
  • Relax and get out more

Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?

Thanks in advance.

+2  A: 

There's two "best practice" ways to index a high traffic logging table:

  1. an integer identity column as a primary clustered key
  2. a uniqueidentifier colum as primary key, with DEFAULT NEWSEQUENTIALID()

Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.

I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.

Andomar
And note the "NewSequentialID() default. Unique identifiers that are not sequential will kill performance on tables with heavy insert traffic. In any case, unless you have a specific need for unique identifiers, go with an integer identity.
Cylon Cat
Interesting, thanks. I'm going to be querying this table on the basis of that timestamp fairly frequently, and *never* by any kind of unique integer key. Since the timestamp is monotonic (though not perfectly unique), why is that not a better choice for the clustered index? I mean, I expect the timestamp to be unique at least 95% of the time; not like logging web hits where it could easily be a lot lower than that. (+1, btw, for the "don't optimize prematurely" reminder.)
T.J. Crowder
SQL Server doesn't know only greater dates can be entered in a timestamp column; you might enter a date in the past. That makes it harder to keep the index filled optimally
Andomar
Seems like another case where the addition of an IDENTITY column seems to be a knee-jerk reaction and not a real design decision. If you are not going to be looking back on specific, individual rows in the log because you need to identify them by ID, and your queries will always be on date ranges, then ditch the identity column and put the clustered index on the datetime column. IMHO.
Aaron Bertrand
@Andomar, I think that is only going to be an issue if you really do intend to backfill data, which is very uncommon indeed for a real-time logging table.
Aaron Bertrand
@Andomar: But does it care? **I** know the timestamp is monotonic, so I know I'm only ever going to be asking it to add to the last page, hence the fill factor. (Just tell me if I'm suffering from the danger of a little knowledge, it's been known.)
T.J. Crowder
An `identity` or `newsequentialid()` clustered index allows SQL Server to completely fill the data and index pages, see http://msdn.microsoft.com/en-us/library/ms189786.aspx
Andomar
But what possible purpose could such a column serve? Any table without such columns could certainly be designed to fill data and index pages as well. This doesn't seem to be a valid reason, at least in isolation, for putting an IDENTITY or NEWSEQUENTIALID() column on a table.
Aaron Bertrand
+1  A: 

I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.

After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.

Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance

AdaTheDev
Thanks. My thinking on the fill factor is that I'm only ever going to be inserting either the same value as my last insert (rarely) or a higher value, never *inserting* into the sequence. So leaving room for adding records to existing index pages seems unnecessary -- to my incomplete understanding of things. :-)
T.J. Crowder
Yep that is a good point. I was thinking a very high fillfactor anyway. I think you're on the money with your research. But still monitor it, do some tests. Btw, you can monitor page splits in perfmon.
AdaTheDev
A: 

The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.

Dr. Zim
Thanks. I think I did indicate how I was going to be querying it: A lot on the basis of the timestamp column, occasionally on other columns not in combination with the timestamp.
T.J. Crowder
There is a combo that I am not getting across well (my bad): it is better to include all your search fields in the index as one composite index. Even the order of the fields matters. Obviously the point of your question, but there is a fancy trick to matching your search fields and what fields go in to your index.
Dr. Zim
Thanks. I think you're talking about situations where I can satisfy the query direct from a compound index without having to then seek to the "real" data. Valid point, but not necessary in this case; the kinds of queries we'll be doing will always take us to the main data.
T.J. Crowder
+2  A: 

Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!

Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.

She mentions (about in the middle of the article):

Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.

The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.

Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!

If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.

marc_s
Very valuable reading, thanks!
T.J. Crowder