views:

1346

answers:

6

What are the common design approaches taken in loading data from a typical Entity-Relationship OLTP database model into a Kimball star schema Data Warehouse/Marts model?

  • Do you use a staging area to perform the transformation and then load into the warehouse?
  • How do you link data between the warehouse and the OLTP database?
  • Where/How do you manage the transformation process - in the database as sprocs, dts/ssis packages, or SQL from application code?
+5  A: 

Personally, I tend to work as follows:

  1. Design the data warehouse first. In particular, design the tables that are needed as part of the DW, ignoring any staging tables.
  2. Design the ETL, using SSIS, but sometimes with SSIS calling stored procedures in the involved databases.
  3. If any staging tables are required as part of the ETL, fine, but at the same time make sure they get cleaned up. A staging table used only as part of a single series of ETL steps should be truncated after those steps are completed, with or without success.
  4. I have the SSIS packages refer to the OLTP database at least to pull data into the staging tables. Depending on the situation, they may process the OLTP tables directly into the data warehouse. All such queries are performed WITH(NOLOCK).
  5. Document, Document, Document. Make it clear what inputs are used by each package, and where the output goes. Make sure to document the criteria by which the input are selected (last 24 hours? since last success? new identity values? all rows?)

This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.

John Saunders
@John - did you use the Kimble "fact and dimension star schema" design for your data warehouse model?
Russ Cam
I think so. I've never read this "Kimble" guy, though people invoke his name in data warehouse almost as much as people invoke "Knuth" in algorithms. But then again, I never _did_ finish that first Knuth book, and wound up selling the entire set. The datamart I'm working on now is more of a snowflake, as we have a few dimensions that have dimensions. Our situation is analogous to having both customers and salesmen being dimensions, and they both have geography.
John Saunders
Apologies, I meant Kimball, not Kimble :)
Russ Cam
Haven't read him, either.
John Saunders
+1  A: 

John Saunders' process explanation is a good.

If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".

Funilly enough, one of the authors is Ralph Kimball :-)

John Sansom
Apologies, I meant Kimball, not Kimble :) I have borrowed the book from a colleague, but wanted to get a heads up on the common strategies and tools used for performing the data transformation
Russ Cam
SSIS is the way to go. The majority of tasks you would need to perform are already catered for by the existing package components and the parallel processing capability can really get things moving quickly too in terms of throughput.
John Sansom
+1  A: 

I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.

We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).

SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).

We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.

JonoW
A: 

As a related question, are there any good tools that give a non-expert rapid ways to start to develop data warehouses from OLTP systems?

From my research, there are two main proponents, Ralph Kimball and Bill Inmon. Getting a book by either of these is a great start, perhaps Inmon's "Building the Data Warehouse"
Russ Cam
I'd suggest the Kimball approach myself as he is more readable- perhaps 'The Data Warehouse lifecycle toolkit". However for non-expert rapid ways, we're looking at tools like Composite which help build useful data warehouses on top of an OLTP system .
Rich
A: 

You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.

Stephan Eggermont
+1  A: 

I agree with the highly rated answer but thought I'd add the following:

* Do you use a staging area to perform the transformation and then

load into the warehouse?

It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records. The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.

* How do you link data between the warehouse and the OLTP database?

It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.

* Where/How do you manage the transformation process - in the

database as sprocs, dts/ssis packages, or SQL from application code?

This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.

Rich