ansaurus

Question

Answer 1

+2 A:

By this phrase:

With the combination of StartDateTime and EndDateTime, they would be unique.

you mean that they never overlap or that they satisfy the database UNIQUE constraint?

If the former, then you can use the StartDateTime in joins, but note that it may be inefficient, since it will use a "<=" condition instead of "=".

If the latter, then just use a fake identity.

Databases in general do not allow an efficient algorithm for this query:

SELECT  *
FROM    TransactionState
WHERE   @value BETWEEN StartDateTime AND EndDateTime

, unless you do arcane tricks with SPATIAL data.

That's why you'll have to use this condition in a JOIN:

SELECT  *
FROM    factTable
CROSS APPLY
        (
        SELECT  TOP 1 *
        FROM    TransactionState
        WHERE   StartDateTime <= factDateTime
        ORDER BY
                StartDateTime DESC
        )

, which will deprive the optimizer of possibility to use HASH JOIN, which is most efficient for such queries in many cases.

See this article for more details on this approach:

Converting currencies

Rewriting the query so that it can use HASH JOIN resulted in 600% times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).

Since your time component is stripped of your StartDateTime and EndDateTime, you can create a CTE like this:

WITH    cal AS
        (
        SELECT CAST('2009-01-01' AS DATE) AS cdate
        UNION ALL
        SELECT DATEADD(day, 1, cdate)
        FROM   cal
        WHERE  cdate <= '2009-03-01'
        ),
        state AS
        (
        SELECT  cdate, ts.*
        FROM    cal
        CROSS APPLY
                (
                SELECT  TOP 1 *
                FROM    TransactionState
                WHERE   StartDateTime <= cdate
                ORDER BY
                        StartDateTime DESC
                ) ts
        WHERE   ts.EndDateTime >= cdate
        )
SELECT  *
FROM    factTable
JOIN    state
ON      cdate = DATE(factDate)

If your date ranges span more than 100 dates, adjust MAXRECURSION option on CTE.

Quassnoi 2009-07-28 16:54:19

@Quassnoi: The time component is stripped off for both StartDateTime and EndDateTime - they will be a day apart.

Raj More 2009-07-28 18:24:12

@Quassnio: Thank you for the code sample. What are your views on my original question about the two bachellorettes?

Raj More 2009-07-28 19:21:47

@Tapori: if by "bachelorettes" you mean `SCD` types, then `Type 2` is better is you are going to have more than `2` versions of each state. It is generally considered more slow to query (for the reasons I described in the post), but using the technique from the post you can get decent performance.

Quassnoi 2009-07-28 19:59:15

@Tapori: `Type 2` is the design using `StartDate` and `EndDate`. The one with `OriginalTransactionStateId ` is `Type 3`.

Quassnoi 2009-07-28 20:00:37

Answer 2

+1 A:

Please be aware that IDENTITY(1,1) is a declaration for auto-generating values in that column. This is different than PRIMARY KEY, which is a declaration that makes a column into a primary key clustered index. These two declarations mean different things and there are performance implications if you don't say PRIMARY KEY.

David B 2009-07-28 17:03:45

@David B: Beg pardon, I did not post the entire DDL for the table. The code posted it OTTOMH. Since my question was regarding datawarehousing design, I focussed on that part.

Raj More 2009-07-28 18:22:25

Answer 3

+1 A:

You could also use SSIS to load the DW. In the slowly changing dimension (SCD) transformation, you can set how to treat each attribute. If a historical attribute is selected, the type 2 SCD is applied to the whole row, and the transformation takes care of details. You also get to configure if you prefer start_date, end_date or a current/expired column.

The thing to differentiate here is difference between the primary key and a the business (natural) key. Primary key uniquely identifies a row in the table. Business key uniquely identifies a business object/entity and it can be repeated in a dimension table. Each time a SCD 2 is applied, a new row is inserted, with a new primary key, but the same business key; the old row is then marked as expired, while the new one is marked as current -- or start date and end date fields are populated appropriately.

The DW should not expose primary keys, so incoming data from OLTP contains business keys, while assignment of primary keys is under control of the DW; IDENTITY int is good for PKs in dimension tables.

The cool thing is that SCD transformation in SSIS takes care of this.

Damir Sudarevic 2009-11-29 04:50:37

ansaurus

tags:

views:

answers:

Type II dimension joins

related questions