views:

164

answers:

1

Suppose you have a table with a Type 2 slowly-changing dimension.

Let's express this table as follows, with the following columns:

* [Key]
* [Value1]
* ...
* [ValueN]
* [StartDate]
* [ExpiryDate]

In this example, let's suppose that [StartDate] is effectively the date in which the values for a given [Key] become known to the system. So our primary key would be composed of both [StartDate] and [Key].

When a new set of values arrives for a given [Key], we assign [ExpiryDate] to some pre-defined high surrogate value such as '12/31/9999'. We then set the existing "most recent" records for that [Key] to have an [ExpiryDate] that is equal to the [StartDate] of the new value. A simple update based on a join.


So if we always wanted to get the most recent records for a given [Key], we know we could create a clustered index that is:

* [ExpiryDate] ASC
* [Key] ASC

Although the keyspace may be very wide (say, a million keys), we can minimize the number of pages between reads by initially ordering them by [ExpiryDate]. And since we know the most recent record for a given key will always have an [ExpiryDate] of '12/31/9999', we can use that to our advantage.

However... what if we want to get a point-in-time snapshot of all [Key]s at a given time? Theoretically, the entirety of the keyspace isn't all being updated at the same time. Therefore for a given point-in-time, the window between [StartDate] and [ExpiryDate] is variable, so ordering by either [StartDate] or [ExpiryDate] would never yield a result in which all the records you're looking for are contiguous. Granted, you can immediately throw out all records in which the [StartDate] is greater than your defined point-in-time.


In essence, in a typical RDBMS, what indexing strategy affords the best way to minimize the number of reads to retrieve the values for all keys for a given point-in-time? I realize I can at least maximize IO by partitioning the table by [Key], however this certainly isn't ideal.

Alternatively, is there a different type of slowly-changing-dimension that solves this problem in a more performant manner?

+1  A: 

Lazy DBA

Are you talking about bringing back all the values in your dimension table? If so, then why not add a non-clustered index with additional coverage such that you're only pulling values out of the index itself, rather than from the table? That way you're scanning a B-Tree with some attached "covered" values, as opposed to potentially performing a table scan? I can't vouch for relative performance, but it's worth testing for the scenario you're obviously working on.

Cheers

Ozziemedes http://ozziemedes.blogspot.com/

Ozziemedes