views:

39

answers:

3

I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.

In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).

There are currently 10,390 records but this figure is expected to grow.

I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.

What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?

I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.

I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?

A: 

It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:

  1. "Extract" data from your OLTP database
  2. Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
  3. Insert the data ("Load") in to your dimension table.

Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.

GregH
OK, what I'm doing is taking a snapshot of my OLTP data (extract) and then comparing that with my data in the OLAP database.My question really is what is the best way to compare the data? Checking for new values is simple, but checking for updated values is more difficult since I don't have an updated date flag on any OLTP tables.What is the best way to go through a record set, checking if a value in one table is different to the value in another table? Unfortunately, the infrastructure doesn't allow me to use SSIS sop I'll need to do it through a stored proc.
Paul
A: 

Does you OLTP database have an audit trail?

If so, then you can query the audit trail for just the records that have been touched since the last ETL.

wshato
Unfortunately it doesn't. Unless there is an easy way to check for changed records I'm thinking it might be easier to create an audit trial although this will mean pretty major changes to the OLTP database.
Paul
+1  A: 
  • Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:

    WHERE LastUpdated >= some_time_here

You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)

  • When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
Damir Sudarevic
Thanks, VERY helpful.
Paul