views:

136

answers:

3

Suppose we have a denormalized table with about 80 columns, and grows at the rate of ~10 million rows (about 5GB) per month. We currently have 3 1/2 years of data (~400M rows, ~200GB).

We create a clustered index to best suit retrieving data from the table on the following columns that serve as our primary key...

    [FileDate] ASC, 
    [Region] ASC,
    [KeyValue1] ASC, 
    [KeyValue2] ASC

... because when we query the table, we always have the entire primary key.

So these queries always result in clustered index seeks and are therefore very fast, and fragmentation is kept to a minimum. However, we do have a situation where we want to get the most recent FileDate for every Region, typically for reports, i.e.

    SELECT
     [Region]
    , MAX([FileDate]) AS [FileDate]
    FROM
     HugeTable
    GROUP BY
     [Region]

The "best" solution I can come up to this is to create a non-clustered index on Region. Although it means an additional insert on the table during loads, the hit isn't minimal (we load 4 times per day, so fewer than 100,000 additional index inserts per load). Since the table is also partitioned by FileDate, results to our query come back quickly enough (200ms or so), and that result set is cached until the next load.

However I'm guessing that someone with more data warehousing experience might have a solution that's more optimal, as this, for some reason, doesn't "feel right".

+1  A: 

Another option would be to have another table (Region, FileDate) which holds the most recent FileDate for each Region. You would update this table during your load.

AdamRalph
Another index on (Region, FileDate) would be simpler, no?
gbn
Yep, definitely simpler.I was thinking of a) data volume and b) update time. To get the latter benefit you would require knowledge of Region, Max(FileDate) independently of the of loading data (otherwise you'd end up doing a conditional on each insert anyway). If data volume is not an issue and/or the number of updates can't be reduced as I've described then the additional covering index would be the best way.
AdamRalph
+1  A: 

I'd create the covering index (nonclustered) on (Region, FileDate), not just region. However, it will be large because you have a wide clustered key.

Otherwise, try AdamRalph's idea but I think this is overhead that outweighs another index

gbn
Was a typo in my original question. Indeed I couldn't think of anything better than an additional non-clustered index. Everything else has too much pre-processing or post-processing involved to make it worthwhile.
The Lazy DBA
A: 

Any chance you could build a cube in Analysis Services, and run your aggregation query against the cube?

The queries should be faster, although there would be a delay from when your data changes until when the cube finishes updating.

RickNZ