tags:

views:

44

answers:

2

I have a database table that stores the file path of a file and a revision number.

  • Each file has a revision number associated with it.
  • Each file has about 10 revisions at any given time.
  • Every day a new revision gets created for every file
  • Every day the oldest revision gets removed for every file.
  • There are about 100 Million files

Please trust that the above needs to be this way, this is a scaled down example of my real problem.

Would a good clustered index for this table be 'revision number' since I am always removing all revisions that are close to each other? And then I'm also adding all new revisions the same revision each day for each file.

+2  A: 

According to Kimberly L. Tripp's Blog, a clustered index should be:

  • unique
  • narrow
  • static
  • ever increasing

So, let's evaluate your proposed 'revision number' in light of these criteria.

  • unique - This will depend on your answer to Ed Harper's comment. If it's not unique on it's own, it seems like the combination of revision number + file would be.
  • narrow - Assuming revision is something like an integer, you're OK here. If you need to go to revision number + file for uniqueness, and if the file has an ID that's another integer, you're still OK.
  • static - It sounds like a revision is never changed once it is created, so you're good here.
  • Ever increasing - I'm reading between the lines a little bit now, but I think your new revisions are probably created in this way.

In conclusion, depending on the uniqueness of revision numbers, it seems like either revision number or revision number + file ID would be a good choice for the clustered index.

Joe Stefanelli
Please see my reply above. It is not unique and the revision number is an int.
MSSQL
Then, assuming you have a file ID (integer) to work with, I'd cluster on Revision Number + File ID.
Joe Stefanelli
A: 

On top of Joe Stefanelli's answer, I'd add:

  • How is the table being used?
  • Is it merely a dump or log of activity, is it used for OLTP purposes (lookup a few rows at a time), is it used for OLAP-like activity (read many, many rows at a time)?
  • Is performanace critical (must retrieve rows in microsecond time) or secondary (say for end-of day reports)?

Since you only get one clustered index, I'd tailor the clustered index based on these answers so that is best supports system requirements. Some ideas:

If it's a daily log that is rarely if ever queried, a clustered index on only RevisionNumber would be adequate.

If you report on all files loaded on a given day, the clustered index on RevisionNumber would be ideal.

If you need to look up individual files with any kind of frequency or expediency, that index would suck since, if I've got it right, there'd be 100,000,000 rows (files) for each RevisionNumber -- but a simple non-clustered index on FileName, or FileName + RevisionNumber, would cover that (but see the next idea).

For fast lookups, FileName, FilePath, or FilePath + FileName could be painfully long strings to index. Adding a column (or a persisted calculated column) for checksum(FileWhatever) and indexing on that could save a lot of time. Queries would have to look something like:

SELECT FullFileName, Plus, Other, Columns
 from FileTable
 where RevisionNumber = @TargetRevision
  and ChecksumColumn = checksum(@TargetFullFileName)
  and FullFileName = @TargetFullFileName

Lastly, if you're really adding and dropping about 100,000,000 rows every day, I'd serious look into table partitioning, with the partition based on RevisionNumber.

Philip Kelley