views:

1125

answers:

5

Hi.

We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.

Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.

Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having Upload (raw source system data, unmodified) tables, Staging (intermediate processing, typed and cleansed) tables and Warehouse tables. You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys has worked on a warehouse where there is a StagingInput and a StagingOutput, similar story. The team as a whole has a lot of experience, both datawarehouse and otherwise.

However, despite all this, looking through Kimball and the web there seems to be absolutely nothing in writing about giving any kind of structure to the staging database. One would be forgiven for believing that Mr Kimball would have us all work with staging as this big deep dark unstructured pool of data.

Whilst of course it is pretty obvious how to go about it if we want to add some more structure to the staging area, it seems very odd that there seems to be nothing written about it.

So, what is everyone else out there doing? Is staging just this big unstructured mess or do folk have some interesting designs on it?

A: 

Personally, I don't go looking for trouble, in Kimball, or elsewhere.

What kind of "structure" are you looking for? What kind of "structure" do you feel is needed? What problems are you seeing from the lack of "structure" you have today?

I may be leaving you with the impression that I don't think much of Kimball. Not so - I haven't read Kimball. I just don't think much of changing things for no reason beyond fitting some pattern. Change to solve some real-world problem would be fine. For instance, if you find you're backing up staging tables because a lack of structure caused the staging and warehouse tables to be treated the same, then this would be a reason to change the structure. But if that's the sort of thing you had in mind, then you should edit your question to indicate it.

John Saunders
The driver for us looking at this right now is that we need to be able to separate the "upload" process from the "staging" process as feeds become available at different times. We have a requirement to upload feeds as they become available, and then run process the rest of the ETL. At the moment the whole staging process is all mixed up in one big series of tasks.Quite aside from that, have a requirement to write structured software to satisfy our audit requirements.
Chris Needham
@Chris: You should clarify your question, then. I read it to be about the tables within the database, and not about structuring the process. That's a whole different question.
John Saunders
I don't think we can completely seperate the structure of the ETL from that of the tables. Yes, my question was primarily about the tables structure (it goes against the grain to have a large number of tables with no RI, constraints or anything), but the ETL structure follows on from how the tables are arranged.
Chris Needham
+3  A: 

I have experienced the same problem. We have a large HR DataWarehouse and I am pulling data from systems all over the enterprise. I've got a nice collection of Fact and Dimension tables, but the staging area is a mess. I don't know of any standards for design of this. I would follow the same path you are on and come up with a standard set of names to keep things in order. Your suggestion is pretty good for the naming. I'd keep working with that.

ChrisLoris
Curious, an area that nobody seems to be interested in, yet one which affects every BI project of any scale. I guess that the Upload and Staging distinction will give us some structure at least.
Chris Needham
+1  A: 

There can be sub areas in Staging. Called staging1, staging2, for example.

Staging1 can be a directly pull from data sources with no transformation. And Staging1 only keeps the latest data.

Staging2 keeps data transformed and ready to go to warehouse. Staging2 keeps all historical data.

Ken Yao
Thanks Ken, yes, this is similar to designs I have worked with in the past. What I find strange is that there is nothing published about it.
Chris Needham
+2  A: 

Just a note, there is a book called "The Data Warehouse ETL Toolkit" by Raph Kimball and Joe Caserta, so Mr. Kimball did put some effort into this. :)

Damir Sudarevic
A: 

Have a look at this post here. It gives a good overview of the responsibilities of a staging area within a DW.

Patrick Peters