views:

306

answers:

3

We've got data with two different origins: some comes from a customer, some comes from different vendors. Currently, we physically "merge" this data into a massive table with almost a hundred columns, tens of thousands of rows and no formal separation of the two dimensions. Consequently, we can't actually use this table for much.

I'm going to redesign this mess into a proper, but small, star schema.

The two dimensions are obvious. One of them, for example, is time.

The customer-supplied data provides a number of fact values. Each vendor may (or may not) provide additional fact values that fit the same dimensions.

This fact data all has the same granularity. It can be called "sparse" because we don't often get information from all vendors.

Here's my dilemma.

Is this one fact table -- with some nulls -- populated from different sources?

Or is this n+1 fact tables -- one populated from the customer, the others populated from each vendor?

There are pros and cons to each design. I need some second opinions on the choice between "merge" or "load separately".


Customer supplies revenue, cost, counts, weights, and other things they know about their end of a transaction.

Vendor one supplies some additional details about some of the transactions -- weights, costs, durations. The other transactions will have no value from vendor one.

Vendor two supplies some additional details about some of the transactions -- volumes, durations, lengths, foreign currency rates. The other transactions will have no value for vendor two.

Some transactions will have both vendors. A few transactions will have neither vendor.

One table with nulls? Three tables?

+2  A: 

I'd go for the single fact table. The highlight pro of this approach is that it leaves all the hard work at load time rather than at query time.

Bell
A: 

From what you describe, it sounds like a single fact table is the way to go.

It sounds like the fact table would have a grain of time x transaction x customer(?).

My prior question was really trying to find out if some of the vendor data was a candidate for its own dimension. I'll leave it to you to determine that. but it doesn't really sound like it.

Null facts can throw warnings during aggregations (depending on the platform) but the alternative of populating them with possibly misleading zeros is worse.

TrickyNixon
A: 

I believe that since both sources share the same grain the answer is that you should have one fact table. Think about how you want your end-users to interact with the information. If it makes sense and the business reports will benefit from those data being co-located then that is your answer. Try though to avoid nulls in your fact tables. If you can enter a zero (and the zero makes sense for the data, i.e., think temperature) then do that. It will save your users some confusion and as TrickyNixon pointed out will cause aggregation issues.

Actually you're at a great point here on the 'brownfield' application. You can look at what exists today and leverage experience to create a better design. This is the most important time to select the best grain that will hopefully not change for the life of the DW.

esabine