views:

142

answers:

3

In my brand new data warehouse that is built (of course) from the OLTP database, I have dropped all the IDENTITY columns and changed them to INT columns.

What are the best practices regarding the following especially since the warehouse is denormalized:

  1. Primary Key
    -> this may now be a composite key because several tables have come together
    -> do i need to follow the key structure from OLTP?

  2. Constraints
    -> there are some constraints (NOT NULL) with default values (0) for bit columns
+1  A: 

For your primary key, consider using a surrogate or alternate key; you'll need to cater for slowly changing dimensions, e.g. if you're doing a report over the last 5 years on average sales per married/unmarried salesperson, you'll want to register the fact that someone was unmarried for 2 years, then married for the last 3. This means your data warehouse will have two dimension-table rows for the same person. Following the OLTP structure for that will be hard :)

Constraints are less of an issue; DWs are massively optimised for reads (assuming you're populating as a batch), and constraints don't really factor in read operations. You can typically get around any constraint issues with your DW populating job, and deal with nulls etc. at the reporting tool if necessary. It's far more important to make sure default values fit with your conceptual data model, and don't introduce issues at the DW client tools.

Jeremy Smyth
@Jeremy- So if my OLTP has a Person table and a MaritalStatus lookup and a PersonsMaritalStatus table, and then I denormalized it, then it would be one table in the warehouse called Person with a composite key of PersonId and MaritalStatusId. This would account for change in marital status like you describe.<br>My questions then are:<br>Do I use the composite key or create a new column (like i do in OLTP?<br>Do I even use waste my one clustered index on that key or do I save it for something important?
Raj More
Well the difficulty is you'd have to do the same thing for Zipcode or PhoneNumber or any other field or combination of fields that would change infrequently, that you'd need to report on.That's why most solutions that handle slowly changing dimensions will implement an alternate key :)
Jeremy Smyth
A: 

I would say about 2.: Bit columns -> work as bool cols -> only 1/0 (true/false) allowed -> Constraint ok

Richard
+1  A: 

For dimension tables:

  • Keep the surrogate auto-increment (identity) PK, except for date dimension (see below).
  • Makes sure you have an alternate "natural key" to allow for slowly changing dimensions (type 2).
  • No nulls allowed in dimension tables, replace them with verbose "n/a, not entered, unknown.."
  • If possible, change Boolean flags (1/0) to verbose "Yes, No", to make it report/ business user friendly.
  • Get rid of calculated fields and replace them with values, or at least persist the calculated field -- depends on a db.
  • Implement the star schema if you can, trade space for speed. Snowflake only if you have to.
  • Check your queries, if there is a function in the WHERE clause, add a column to the dimension table and pre-calculate values.
  • It is easy to partition date dimension if the PK looks like 20090619.
  • Get rid of check constraints and defaults, move that to the conform phase of the ETL process. Checks and defaults slow down loading, and once you finish loading they play no part.

For fact tables:

  • Consider having a surrogate auto-increment (identity) PK to allow for easy partitioning, if using composite PK, you can make a composite unique non-clustered instead.

  • Have scripts for your foreign keys in a safe place, it is a common practice to drop keys before loading fact tables in order to speed-up loading. Some people run DW with foreign keys being "logical only", they use "look-for-orphans" queries after loading.

ETL

  • Design your ECCD (ETL) process through all stages: extract, clean, conform, deliver.
  • If possible, keep intermediate results (files) after each stage for audit and debug purposes.
  • Document the ETL, and if using scripts use a version control so you can match script versions with archived (intermediate) files.
  • Have a data lineage chart, Excel is better than nothing. Keep versions too.
Damir Sudarevic
Well written. I wish I could upvote this more than once.
Raj More