views:

211

answers:

1

Currently, we plan to record a "batch id" for each batch of facts we load. That way, we can back out the load in case we find problems.

Should we consider tracking the batch id on the dimension rows, also?

It seems like dimension rows have different rules. If we treat them as slowly-changing, and use one of the SCD algorithms that preserves history, then a reload doesn't really mean much.

Typical Scenario. Conform dimension, handling SCD. Load facts. Done.

Extension. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts. Fix the problem. Reload facts. Done.

Possible Scenario. Conform dimension, handling SCD. Load facts. Find a problem. Delete the batch of facts and the dimension rows. Fix the problem. Conform dimension, handling SCD. Load facts. Done.

It doesn't seem like tracking dimension changes helps very much at all. Any guidance on how best to handle an "undo" or "rollback" of a data warehouse load?

Our ETL tools are entirely home-grown Python applications.

+2  A: 

From my perspective as long as you are not abusing your dimensions (like tracking time to the millisecond) there is not a lot of gain to be had by tracking dimensions for a rollback. Also you can build a tool to cleanup unreferenced dimensions once a month.

Jeffrey Hulten