views:

72

answers:

2

I'm working on a data warehouse and I'm trying to figure out how to best verify that data from our data cleansing (normalized) database makes it into our data marts correctly. I've done some searches, but the results so far talk more about ensuring things like constraints are in place and that you need to do data validation during the ETL process (E.g. dates are valid, etc.). The dimensions were pretty easy as I could easily either leverage the primary key or write a very simple and verifiable query to get the data. The fact tables are more complex.

Any thoughts? We're trying to make this very easy for a subject matter export to run a couple queries, see some data from both the data cleansing database and the data marts, and visually compare the two to ensure they are correct.

+1  A: 

You test your fact table loads by implementing a simplified, pared-down subset of the same data manipulation elsewhere, and comparing the results.

You calculate the same totals, counts, or other figures at least twice. Once from the fact table itself, after it has finished loading, and once from some other source:

  • the source data directly, controlling for all the scrubbing steps in between source and fact
  • a source system report that is known to be correct
  • etc.

If you are doing this in the database, you could write each test as a query that returns no records if everything correct. Any records that get returned are exceptions: count of x by (y,z) does not match.

See this excellent post by ConcernedOfTunbridgeWells for more recommendations.

Peter
Thanks for the info. Because I couldn't find anything yesterday, I started doing something similar where I was looking at chunks of data in the data mart and comparing those chunks to the specific record it came from in our cleansing DB. However, I do like the idea of doing the calculations twice. We just don't want our validation to look like our ETL process.
blockcipher
A: 

Although it has some drawbacks and potential problems if you do a lot of cleansing or transforming, I've found you can round trip an input file by re-generating the input file from the star schema(s). Then simply comparing the input file to the output file. It might require some massaging to make them match (one is left padded, the other right padded).

Typically, I had a program which used the same layout the ETL used and did a compare, ignoring alignment within a field. Also, the files might have to be sorted - there is a command-line sort I used.

If your ETL does a transform incorrectly and you transform out incorrectly, it's still possible that this method doesn't show every problem in the DW, and I wouldn't claim it has complete coverage, but it's a pretty good first whack at a regression unit test for each load.

Cade Roux