Let's assume you have one massive table with three columns as shown below:
[id] INT NOT NULL,
[date] SMALLDATETIME NOT NULL,
[sales] FLOAT NULL
Also assume you are limited to one physical disk and one filegroup (PRIMARY). You expect this table to hold sales for 10,000,000+ ids, across 100's of dates (easily 1B+ records).
As with many data warehousing scenarios, the data will typically grow sequentially by date (i.e., each time you perform a data load, you will be inserting new dates, and maybe updating some of the more recent dates of data). For analytic purposes, the data will often be queried and aggregated for a random set of ~10,000 ids which will be specified via a join with another table. Often, these queries don't specify date ranges, or specify very wide date ranges, which leads me to my question: What is the best way to index / partition this table?
I have thought about this for a while, but am stuck with conflicting solutions:
Option #1: As data will be loaded sequentially by date, define the clustered index (and primary key) as [date], [id]. Also create a "sliding window" partitioning function / scheme on date allowing rapid movement of new data in / out of the table. Potentially create a non-clustered index on id to help with querying.
Expected Outcome #1: This setup will be very fast for data loading purposes, but sub-optimal when it comes to analytic reads as, in a worst case scenario (no limiting by dates, unlucky with set of id's queried), 100% of the data pages may be read.
Option #2: As the data will be queried for only a small subset of ids at a time, define the clustered index (and primary key) as [id], [date]. Do not bother to create a partitioned table.
Expected Outcome #2: Expected huge performance hit when it comes to loading data as we can no longer quickly limit by date. Expected huge performance benefit when it comes to my analytic queries as it will minimize the number of data pages read.
Option #3: Clustered (and primary key) as follows: [id], [date]; "sliding window" partition function / scheme on date.
Expected Outcome #3: Not sure what to expect. Given that the first column in the clustered index is [id] and thus (it is my understanding) the data is arranged by ID, I would expect good performance from my analytic queries. However, the data is partitioned by date, which is contrary to the definition of the clustered index (but still aligned as date is part of the index). I haven't found much documentation that speaks to this scenario and what, if any, performance benefits I may get from this, which brings me to my final, bonus question:
If I am creating a table on one filegroup on one disk, with a clustered index on one column, is there any benefit (besides partition switching when loading the data) that comes from defining a partition on the same column?