views:

281

answers:

3

I have a dimension (SiteItem) has two important facts:

perUserClicks 
perBrowserClicks

however, within this dimension, I have groups of values based on an attribute column (let's call the groups AboveFoldItems, LeftNavItems, OnTheFlyItems, etc.) each have more facts that are specific to that group:

AboveFoldItems: eyeTime, loadTime
LeftNavItems: mouseOverTime
OnTheFlyItems: doesn't have any extra, but may in the future

Is the following fact table schema ok?

DateKey   
SessionKey
SiteItemKey
perUserClicks 
perBrowserClicks
eyeTime
loadTime
mouseOverTime

It seems a little wasteful since only some columns pertain to some dimension keys (the irrelevant facts are left NULL). But... this seems like it would be a common problem, so there should be a common solution for this, right?

+2  A: 

There isn't an elegant solution really, you either have nullable columns or you use an EAV solution. I posted about EAV before (and generated a lot of comments that might be worthwhile reading):

http://sqlblog.com/blogs/aaron_bertrand/archive/2009/11/19/what-is-so-bad-about-eav-anyway.aspx

I am a fan of that model in some scenarios, but if your dimensions/attributes do not change frequently, it can be a lot of extra work for nothing. NULL values in a column do not really make waste as long as the surrounding code can deal with them appropriately.

Aaron Bertrand
Thanks for the link, and comparison to EAV - I hadn't thought of it that way.
Jeff Meatball Yang
+1  A: 

You could have more than one fact table: factperUserClicks, factperBroWserClicks, factEyeTime, etc...

Each of these would have DateKey, SessionKey, SiteItemKey. This way only dimension keys that "make sense" appear with each fact.

Ideally, there should be no NULLS in the DW -- if you keep them in the same fact table, using zeros may be more appropriate.

As far as saving disk space, I do not see an ideal solution -- but, in a DW one is supposed to trade space for speed and (query) simplicity anyway.

Damir Sudarevic
The problem is that I need to query the SiteItem dimensions together, and retrieve the aggregates across a user-defined list of facts. It seems I could join two fact tables together, but would need to do a LEFT JOIN to correctly aggregate.
Jeff Meatball Yang
+2  A: 

I'm generally in agreement with Damir's answer on this, but because the fact table is very narrow in your particular case, there is still merit to Aaron's advocation for keeping the NULLs.

We have several star schemas in particular subject areas with multiple fact tables that share most (if not all) of the dimensions (conformed and internal). The limited-scope dimensions are not considered "conformed" across the enterprise, but they are what we would call "shared internal" dimensions.

Now typically, if the data is loaded contemporaneously so that the dimension hasn't changed, you can join both fact tables on the keys, but in general, of course, you cannot join two different star schemas on the dimension keys if they are surrogates in traditional slowly changing dimensions. In general, you have to join separate stars on the natural keys or "business keys" within the dimension and not on surrogates (except usually in the special case of the date dimension where it is unchanging and only has a natural key).

Note that when you do join the two stars, you have to use a LEFT JOIN, in which case you WILL produce NULLs which you will still probably have to take account of - so you're actually getting back to the original model you had with NULLs! ;-)

The benefit of the extra fact table is more obvious when your tables are wide with a smaller set of keys and the vertical partitioning of the data produces space savings as well as a cleaner logical model - this is especially true when the keys are only really shared up to a point - having one dummy key or NULL key is definitely not a good idea - this usually points to a dimensional modeling problem.

However, as Aaron says, if you push it to extremes, you can have a single fact column in each fact table with shared keys, which means the key overhead dwarfs the fact cost and you really do end up in a disguised EAV model.

I would also look to see if you are in Kimball's situation of "too few dimensions". Seems like you must have good dimensional attributes lumped into the SessionKey and SiteItemKey - but without seeing your entire model and requirements, it's hard to say, but I would think you would have some user demographics in a low-cardinality or even snowflake dimension without the full Session or Site dimension.

Cade Roux
Thanks for the discussion! I think I do have a situation of shared internal dimensions. Your comparison of joining two fact tables sheds light on why we keep NULLs instead of zeros (zeros would affect the average here, and we have selects with weird cases for NULLs. I can't divulge much else about our schema, but you're correct that some users could benefit from additional, more-specific dimensions.
Jeff Meatball Yang