ansaurus

Question

Answer 1

+2 A:

What I'd like to do is come up with a way that will allow us to leverage the data we have - how many tags in Florida for the month of June? How many facilities were on time delivering their documentation? How many annual(surprise) surveys happened in the 1st quarter of this year compared to last year?

A dimension is a measurement range. The measurement range can be continuous, like dates, or discrete, like facilities. In your questions, the dimensions are facility and date, date/time, and date, respectively.

The only way you can answer the question "How many tags in Florida for the month of June?" is to associate tags with facilities and tags with dates.

The only way you can answer the question "How many facilities were on time delivering their documentation?" is to associate documentation delivery with facility and date due with facility.

You should follow this same analytical process with the rest of the questions or queries you expect the data warehouse to answer.

A fact is an entity or an object. A tag is a fact. Documentation delivery is a fact. Facts are almost always immutable in a data warehouse once they're loaded.

As to your schema, I'd have to study it more to give specific recommendations, but in general, you want to use a star schema. The center of the star(s) are your facts, entities, and objects. The tables that make up the points of the star are your dimension tables.

The first thing you need to do is separate your facts and your dimensions. None of your entity tables should contain dates, location codes, or whatever else you determine is a dimension. However, your fact tables will contain foreign keys to date tables, location tables, or other dimension tables.

You'll probably also need summary tables. Summary tables contain the same columns as your fact tables, with the addition of one or more sums across different dimensions. As an example, the question "How many tags in Florida for the month of June?" can be answered much quicker if you already have the sum of the tags for Florida (or, more properly, each facility in Florida) for the month (or each of the days) of June, 2010.

The period that you sum for depends on the mixture of queries that you expect. In your data warehouse, day might be too short a period. In other words, it's just as quick to do the summary in SQL as it is to select the summary row.

You'll need a calendar table too. A calendar table makes questions like, "How many annual(surprise) surveys happened in the 1st quarter of this year compared to (the 1st quarter of) last year?" much easier to query.

Gilbert Le Blanc 2010-07-09 15:50:36

Gilbert,Wow, thank you so much for this, it's really cleared a lot of things up for me. I'm going to take some time this week to really dig into this, I'll try to share my findings if you don't mind :)thanks!M@

Matt Cushing 2010-07-12 13:02:13

No, I don't mind, and you're welcome.

Gilbert Le Blanc 2010-07-12 13:24:45

I'm a little confused. The main piece here is the survey. Each survey can have tags if something bad happens. Would the survey information and tag information be separate fact tables?Lets take a step back and make it simpler - say I want to just track number of surveys per facility. All that information is contained in the SurveyMain table. I would set up a calendar table, the survey fact, and a facility dim table, correct? The survey fact I would populate with the actual survey type instead of the fk to the other table, correct? Am I headed in the right direction?

Matt Cushing 2010-07-13 17:37:36

The tag information should be a separate fact table if tags are infrequently applied to surveys, or if a survey can have more than one tag. The rest of your questions are correct and headed in the right direction.

Gilbert Le Blanc 2010-07-14 00:33:57

Answer 2

A:

It looks like you have multiple Fines, Files and Tags for each survey.

I would expect 4 fact tables - with the facts in each looking like they are largely datetime data (although these are often modelled as roles of a date and/or time dimension - I've made a couple notes here, but flags are generally going to be in dimensions):

SurveyMain

SurveyFine (wasImposed is in a dimension linked to this fact, totalFineAmt is a fact in this table)

SurveyFile

SurveyTag

They would all share a Survey dimension, and I would go ahead and share an Entity/Facility dimension in each one. You could snowflake through the Survey dimension, but that defeats the most beneficial point of star models allowing you to get to all data directly instead of going though bridge tables.

You have an option of putting the survey type in it's own dimension (or a junk dimension, perhaps) or having it accessed through the Survey dimension (not through a snowflake). That's typical with dimensional modeling - you don't need to follow your entities - you just need to avoid the too many dimensions and too few dimensions trap and watch the cardinality of your dimensions - especially if you've accidentally included some degenerate dimension like an invoice number which changes with every fact and so needs to be stored in the fact table.

Actually, it's sometimes easier to do your star models by doing the typical joins in your 3NF which create typical flat reporting views and then simply taking those flat rows and turning them into stars. (That's how little relevance the entity-relationship model really has to the dimensional model). So you might join SurveyMain to SurveyTypes and SurveyFine on your current normalized keys and look at all the columns. This would be the basis for the SurveyFine fact table. Ditto for the other fact tables I identified. The shared stuff would be a candidate for shared dimensions. Entity is a good candidate for a conformed dimension (i.e. it's going to be shared between these survey models and other models related to your enterprise - like HR models or accounting models).

Cade Roux 2010-07-13 18:26:02

Answer 3

A:

I would setup SurveyFines, SurveyTag and SurveyFiles fact tables, they are all different grains of facts and they all represent the lowest grain.

They would all have date, Entity and Survey Dimensions with them.

I would then setup pre-aggregated metric tables for those metrics which might need to combine all three facts.

If you would like me to elaborate feel free to ask. I'm in a bit of rush today.

(continuing...) It would appear to me, that your users want to pivot the measurable data (number of files, date files were sent, sum of fines). They want to look at those metrics by attributes of the Survey. That's why I suggest a survey dimension.

Considering your comment below, I might then build a pre-aggregate metric table,

Date (the date I loaded the metric table) SurveyDimID EntityDimID NumTagsAssigned NumFilesRequested NumFilesReceived NumFines TotalFines etc...

I would load this table everyday with the full set of active survey data from my fact tables. This allows the users to go back and forth through history to see how the survey's came in.

I suppose at some point the entire survey process is complete, at that point those records would not be included in the metric load. (They would remain in the facts).

Markus 2010-07-13 20:30:32

So you're saying that the survey itself is a dimension? The way it works is a survey happens, tags are assessed, and fines assigned to those tags. Files are what is needed to finish the process. file1 is what the govt sends us. file2 is our response back to the govt. The main reason for part of this, is to assess how long it takes to get the files, what is missing, and how long it takes to respond to the govt/state/county. I'd love to be able to give our people a good idea of what has happened historically. think I can get state and federal data to compare it to eventually.

Matt Cushing 2010-07-13 21:16:53

Hi Matt, I updated the answer a bit. One of the metrics in the example table might be daysToRespond and so on.

Markus 2010-07-14 17:42:03

Answer 4

A:

This is quite a task for a support forum, so I will focus on just one part of the problem. Seems that one survey can consists of several visits, so I would suggest factSurveyVisit with a grain of one visit-event. The column SurveyID acts as a degenerate dimension in this model and is common to all visits from the same survey. The SurveyVisitSequenceID is a unique auto-increment (integer) and is used to simplify linking of the two bridge tables for documents and tags to the fact table.

You could also promote a survey into a full dimension dimSurvey to add some notes etc; use SurveyID for link.

I did not tackle fines here, for this I would suggest factFine table which would have its own links to dimDate, dimTime, dimFacility, etc so that reports regarding fines ($$) can be done fast without joining to most of the visit related tables. There should also be a bridge table joining factFine to factSurveyVisit, providing fines are related to each visit and not to a completed survey.

alt text

EDIT

Just noticed that your Tag table has date_cleared, so admittedly I do not understand the tagging in this business. In the model, dimTag is just a list of available tags. There may be one more factFacilityStatus table linking dimFacility and dimTag, tracking tag status for each facility.

Damir Sudarevic 2010-07-17 16:05:30

Damir, I agree completely. I wasn't so much looking for an answer as opposed to raising keywords to understand or topics to wrap my brain around. I've never see grain or bridge used, but I think I understand them as ways to bring data together.Going to pour over your info, thanks for the visuals!!

Matt Cushing 2010-07-19 13:00:53

ansaurus

tags:

views:

answers:

what is Dim, what is Fact?

related questions