views:

1272

answers:

5

I'm tasked with creating a datawarehouse for a client. The tables involved don't really follow the traditional examples out there (product/orders), so I need some help getting started. The client is essentially a processing center for cases (similar to a legal case). Each day, new cases are entered into the DB under the "cases" table. Each column contains some bit of info related to the case. As the case is being processed, additional one-to-many tables are populated with events related to the case. There are quite a few of these event tables, example tables might be: (case-open, case-dept1, case-dept2, case-dept3, etc.). Each of these tables has a caseid which maps back to the "cases" table. There are also a few lookup tables involved as well.

Currently, the reporting needs relate to exposing bottlenecks in the various stages and the granularity is at the hour level for certain areas of the process.

I may be asking too much here, but I'm looking for some direction as to how I should setup my Dim and Fact tables or any other suggestions you might have.

+4  A: 

The fact table is the case event and it is 'factless' in that it has no numerical value. The dimensions would be time, event type, case and maybe some others depending on what other data is in the system.

You need to consolidate the event tables into a single fact table, labelled with an 'event type' dimension. The throughput/bottleneck reports are calculating differences between event times for specific combinations of event types on a given case.

The reports should calculate the event-event times and possibly bin them into a histogram. You could also label certain types of event combinations and apply the label to the events of interest. These events could then have the time recorded against them, which would allow slice-and-dice operations on the times with an OLAP tool.

If you want to benchmark certain stages in the life-cycle progression you would have a table that goes case type, event type1, event type 2, benchmark time.

With a bit of massaging, you might be able to use a data mining toolkit or even a simple regression analysis to spot correlations between case attributes and event-event times (YMMV).

ConcernedOfTunbridgeWells
Hey Nigel.. Any chance I can ping you with what I came up with.. drop me an email if you'd be willing.. <myusername>@gmail.comThanks BTW.
madcolor
@madcolor: post it for review as yet another question, if it isn't too proprietary.
S.Lott
@MAdColor, @SLott, I second that. D did do some reports of this nature at one point, but this sort of date-date analysis in a data warehouse is probably of more general interest.
ConcernedOfTunbridgeWells
@ConcernedOfTunbridgeW, @MadColor: Your designs are right on. You should also look into adding an accumulated snapshot fact. These tables are used for quickly calculating and reporting lag time between events. Chris Adamson has several articles discussing these http://blog.oaktonsoftware.com/2007/03/accumulating-snapshot-use-accumulating.html
Jamison
+2  A: 

I suggest you check out Kimball's books, particularly this one, which should have some examples to get you thinking about applications to your problem domain.

In any case, you need to decide if a dimensional model is even appropriate. It is quite possible to treat a 3NF database 'enterprise data warehouse' with different indexes or summaries, or whatever.

Without seeing your current schema, it's REALLY hard to say. Sounds like you will end up with several star models with some conformed dimensions tying them together. So you might have a case dimension as one of your conformed dimensions. The facts from each other table would be in fact tables which link both to the conformed dimension and any other dimensions appropriate to the facts, so for instance, if there is an employee id in case-open, that would link to an employee conformed dimension, from the case-open-fact table. This conformed dimension might be linked several times from several of your subsidiary fact tables.

Kimball's modeling method is fairly straightforward, and can be followed like a recipe. You need to start by identifying all your facts, grouping them into fact tables, identifying individual dimensions on each fact table and then grouping them as appropriate into dimension tables, and identifying the type of each dimension.

Cade Roux
A: 

Here's what I came up with essentially. Thx NXC

Fact Events

EventID TimeKey CaseID

Dim Events

EventID EventDesc

Dim Time

TimeKey

Dim Regions

RegionID RegionDesc

Cases

CaseID RegionID

madcolor
Fact table should be one row per event. There should possibly be a case dimension, although this could be a rollup on the event dimension. You are looking to measure time between individual events, so you shouldn't be rolling the events up.
ConcernedOfTunbridgeWells
You're measuring the time between a single event of a certain type and the most recent previous event of some other type that marks a stage in the case's life cycle. For example, registrar's list to first appearance.
ConcernedOfTunbridgeWells
A: 

This may be a case of choosing a solution before you've considered the problem. Not all datawarehouses fit into the star schema model. I don't see that you are aggregating any data here. So far we have a factless fact table and at least one rapidly changing dimension (cases).

Looking at what I see so far I think the central entity in this database should be the case. Trying to stick the event at the middle doesn't seem right. Try looking at it a different way. Perhaps, case, events, and case events to start.

Brad_Z
+1  A: 

Like any other facet of development, you must approach the problem from the end requirements ("user stories" if you will) backwards. The most conservative approach for a warehouse is to simply represent a copy of the transaction database. From there, guided by the requirements, certain optimizations can be made to enhance the performance of certain data access patterns. I believe it is important, however, to see these as optimizations and not assume that a data warehouse automatically must be a complex explosion of every possible dimension over every fact. My experience is that for most purposes, a straight representation is adequate or even ideal for 90+% of analytical queries. For the remainder, first consider indexes, indexed views, additional statistics, or other optimizations that can be made without affecting the structures. Then if aggregation or other redundant structures are needed to improve performance, consider separating these into a "data mart" (at least conceptually) which provides a separation between primitive facts and redundancies thereof. Finally, if the requirements are too fluid and the aggregation demands to heavy to efficiently function this way, then you might consider wholesale explosions of data i.e. star schema. Again though, limit this to the smallest cross section of the data as possible.

N8allan