views:

455

answers:

4

I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:

  • sex (true/false)
  • demographic classification (A, B, C etc)
  • place of birth
  • date of birth
  • weight (recorded daily): The fact that is being recorded

My requirements are to be able to run 'OLAP' queries that allow me to:

  • 'slice and dice'
  • 'drill up/down' the data and
  • generally, be able to view the data from different perspectives

After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.

Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.

'Natural' (or obvious) dimensions are:

  • Date dimension
  • Geographical location

Which have hierarchical attributes. However, I am struggling with how to model the following fields:

  • sex (true/false)
  • demographic classification (A, B, C etc)

The reason I am struggling with these fields is that:

  1. They have no obvious hierarchical attributes which will aid aggregation (AFAIA) - which suggest they should be in a fact table
  2. They are mostly static or very rarely change - which suggests they should be in a dimension table.

Maybe the heuristic I am using above is too crude?

I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.

I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:

  • How does male and female weights compare across different demographic classifications?
  • Which demographic classification (male AND female), show the most increase in weight this quarter.

etc.

Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?

Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?

The 'obvious' schema:

CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));

may not be the correct one.

+1  A: 

Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.

As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of

where sex_type.name = 'FEMALE'

than

where sex_type.is_male = 1

Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.

Edit

"I am thinking of how to deal with cases of new sex_types and demographic categories not already in the database"

There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.

APC
@APC: thanks for the feedback. Now I now that SEX_TYPE and DEMOGRAPHIC_CATEGORY are dimensions. This is new area to me, so I may have to ask a few more questions that may seem banal/asinine to you.. Please bear with me. From the above, my understanding is that I need to have FKs in the fact table that are PKs in the SEX_TYPE and DEMOGRAPHIC_CATEGORY. Could you confirm this? (I am thinking of how to deal with cases of new sex_types and demographic categories not already in the database).
morpheous
A: 

What OLAP / presentation tier tool are you intending to use? These often have features of their own to support building of cubes, hierarchies, aggregations, etc.

Normal Form is usually the most sound basis for a flexible and efficient Data Warehouse, although Marts are sometimes denormalized to support a specific set of reporting requirements. In the absence of any other information I suggest you aim to ensure your database is in at least Boyce-Codd / 5th Normal Form.

dportas
+2  A: 

Not sure why you feel that using RDBMS is poor man's solution, but hope this may help.

alt text

Tables dimGeography and dimDemographic are so-called mini-dimensions; they allow for slicing based on demographic and geography without having to join dimUser, and also to capture user's current demographic and geography at the time of measurement.

And by the way, when in DW world, verbose -- Gender = 'female', AgeGroup = '30-35', EducationLevel = 'university', etc.

Damir Sudarevic
+2  A: 

Generally, all numeric quantities and measures are columns in the fact table(s). Then everything else is a dimensional attribute. Which dimension they belong in is rather pragmatic and depends on the data.

In addition to the suggestions you have already received, I saw no mention of degenerate dimensions. In these cases, things like an invoice number or sequence number timestamp which is different for every fact needs to be stored in the fact, otherwise the dimension table will become 1-1 with the fact table.

A key design decision in your case is probably the analysis of data related to age if the study is ongoing. Because people's ages change with time, they will move to another age group at some point. Depending on whether the groups are fixed at the beginning of a study or not, this may determine how you want to aggregate. I'm not necessarily saying you should have a group dimension and get to age through that, but that you may need to determine the correct age/demographic dimension during the ETL. But this depends on the end use (or accommodate both with two dimension roles linked from the fact table - initial demographics, which never changes, and current demographics which will change with time).

A similar thing could apply with geography. Although you can obviously track a person's geography by analysing current geography changes over time, the point of a dimensional DW is to have all the relevant dimensions linked straight to the fact (things which you might normally derive in a normalized model through the network of an Entity-Relationship model - these get locked in at the time of ETL). This redundancy makes analysis quicker on the dimensional model in traditional RDBMSes.

Note that a lot of this does not apply in massively parallel DW like Teradata, which don't perform well with star schemas - they like all the data normalized and linked up to the same primary index because they the primary index to distribute the data over the processing units.

Cade Roux