Reversing (or undoing) a large load to a warehouse fact table | ansaurus

tags:

data-warehouse

views:

211

answers:

1

Q:

Reversing (or undoing) a large load to a warehouse fact table

Currently, we plan to record a "batch id" for each batch of facts we load. That way, we can back out the load in case we find problems.

Should we consider tracking the batch id on the dimension rows, also?

It seems like dimension rows have different rules. If we treat them as slowly-changing, and use one of the SCD algorithms that preserves history, then a reload doesn't really mean much.

Typical Scenario. Conform dimension, handling SCD. Load facts. Done.

Extension. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts. Fix the problem. Reload facts. Done.

Possible Scenario. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts and the dimension rows. Fix the problem. Conform dimension, handling SCD. Load facts. Done.

It doesn't seem like tracking dimension changes helps very much at all. Any guidance on how best to handle an "undo" or "rollback" of a data warehouse load?

Our ETL tools are entirely home-grown Python applications.

+2 A:

From my perspective as long as you are not abusing your dimensions (like tracking time to the millisecond) there is not a lot of gain to be had by tracking dimensions for a rollback. Also you can build a tool to cleanup unreferenced dimensions once a month.

Jeffrey Hulten 2009-03-31 23:30:55

related questions

Efficiently storing 7.300.000.000 rows

Error while refreshing a MS datawarehousing cube

Free database for small datawarehouse

Database choice for large data volume?

Is it always necessary to create Dim tables?

Moving a SQL 2000 cube to SQL 2005

MDX Calculating Time Between Events

SQL Cube Processing Window

Fact/Dim Table Time Value

Setting up Dim and Fact tables for a Data Warehouse

What are some sample questions the professor could ask on my "ETL to EDW" datawarehousing project?

Datamart vs. reporting Cube, what are the differences?

20 Billion Rows/Month - Hbase / Hive / Greenplum / What ?

Can you recommend a good source for Teradata Best Practices?

Recommendation for a large-scale data warehousing system

Tuning/Best Practices Inetsoft Style Report BI Tool ?

Merge Facts from Different Sources? Or Load Separately?

Map data dimension generated in OWB to database column

Are there any data warehouse frameworks?

How to design a fact table for delivery data

Can I have non-measure codes mixed with measures in my fact table?

Typical Kimball Star-schema Data Warehouse - Model Views Feasible? and How to Code Gen

Looking for tools to analyze email data

Star-Schema Design