views:

1602

answers:

3

THANKS: Both answers here are very helpful, but I could only pick one. I really appreciate the advice!

our datawarehouse will be used more for workflow reports than traditional analytical reports. Our users care about "current picture" far more than history. (though history matters, too.) We are a government entity that does not have costs or related calculations. Mostly just counts of people within given locations and with related history.

We are using Oracle, and I have found distinct advantage in using the star join whenever possible and would like to rearchitect everything to as closely resemble the star schema as is reasonable for our business uses. Speed in this DW is vital, and a number of tests have already proven the star schema approach to me.

Our "person" table is key - it contains over 4 million records and will be the most frequently used source in queries. It can be seen at the center of a star with multiple dimensions (like age, gender, affiliation, location, etc.). It is a very LONG table, particularly when I join it to the address and contact information.

However, it is more like a dimension table when we start looking at history. For example, there are two different history tables that have a person key pointing to the person table. One has over 20 million records and the other has almost 50 million and grows daily.

Is this table a fact table or a dimension table? Can one work as both? If so, is that going to be a big performance problem? Is it common to query more off of a dimension than a fact? What happens if a DIFFERENT fact table that uses the person table as a dimension is actually only 60,000 records (much smaller.).

I think my problem is that our data and use of it does not fit with the commonly use examples of star schemas.

CLARIFICATION: Some good thoughts have been added below, but perhaps I left too much out to really explain well. Here's some more info:

We handle a voter database. We don't have any measures except voter counts by various groups: voter counts by party, by age, by location; voter counts by ballot type and election, by ballot status and election, etc. We do have a "voting history" log as well as an activity audit log (change of address, party, etc.). We have information on which voters are election workers and all that related information. I figure I'll get to the peripheral stuff later.

For now I'm focusing on our two major "business processes": voter registration(which IS a voter.) and election turnout. In the first, voter is a fact. In the second, voter is a dimension, along with party, election, and type of ballot. (and in case anyone is worried - no we don't know HOW people vote. Just that they do. LOL )

I hope that clarifies things a bit.

+1  A: 

If possible, my suggestion would be to refactor these tables so they are more in alignment with a true star schema. Although 50 million records sounds like a lot (when thinking about a transactional system) we have multiple fact tables with as many as 500 million rows. Assuming that your hardware was speced for this type of work, you should not have any issues with combining your tables into one large fact table (assuming they are all within the same subject area).

With that said, make sure that you account for the other factors that should be considered when choosing a highly denormalized structure. The star schema is a great design for reporting on data because of the reduction in the necessary joins, however, you often pay a large price for this while updating tables and in disk space. When you say that you are considering using this schema for more of a workflow application, rather than mostly analytics, then I would make sure to account for the updates. Are updates needed in real time or near real time? If so, again you may not want to consider a star.

Finally, yes in some cases we query only our dimension tables, often when an application needs a specific list of items (i.e. products, customers, etc), this is a valid use, however, a better solution would likely leverage an ODS rather than our star schema.

What I have found is as much as I try to make my schema look like something right out of an Inmon or Kimball textbook, it almost never works without some real world custimization.

Edit I sure have been more specific with the reference to the ODS.

An operational data store (or "ODS") is a database designed to integrate data from multiple sources to make analysis and reporting easier. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the Data warehouse generally on a less frequent basis.

According to Bill Inmon, the originator of the concept, an ODS is "a subject-oriented, integrated, volatile, current-valued, detailed-only collection of data in support of an organization's need for up-to-the-second, operational, integrated, collective information."

ODS differ from Inmon's definition of enterprise data warehouse by having a limited history, and more frequent update than an EDW. In practice ODS tend to be more reflective of source structures in order to speed implementations and provide a truer representation of production data.

http://en.wikipedia.org/wiki/Operational%5Fdata%5Fstore

Irwin M. Fletcher
thanks for the ideas - I am trying to refactor, which is where these questions come from. To answer your questions - there are no updates. The tables are dropped and rebuilt from production nightly. (long run, I want materialized views.). I guess my thought then, is that if I make this a true star, I will query mostly off of the person dimension, and this one dimension will snowflake (with like 10 dimensions). a major business "fact" is registrations, which has a one-to-one relationship with "persons". it's puzzling me a bit.
sql_mommy
question - what's an ODS? I'm unfamiliar with that acronym.
sql_mommy
Edited response to add more clarity.
Irwin M. Fletcher
+1  A: 

Large "people" (customer) dimensions are frequent in telecom, banking, insurance etc. Kimball has a section named "Large Changing Customer Dimensions" under CRM chapter (6). It shows how to create "minidimensions". Frequently changing or frequently analyzed attributes (columns) are broken-off into separate mini-dimension tables. These mini-dimensions are connected via fact table, so fact table has a FK for each of these tables separately.

It seems to me that your example is close to this.

As a general rule, dimension table is a look-up table for objects which rarely change (people, accounts, time, products, stores) and fact table captures activity (history) of interactions between these objects. Fact table contains measures that you would want to aggregate (total sales, number of hours worked, number of parts produced, etc..).

AFTER CLARIFICATION:
I would say that Voter is actually a conformed dimension -- common for all data marts (business processes). Other conformed dimensions would be: Date, Party, Elections, VotingStations. Mini-dimensions would be Demographic and GeoArea. Fact tables would be: RegistrationEvent (who when and where registered) and ElectionEvent (who when and where voted in which election, using what).
Dimension Voter and fact RegistrationEvent are loaded from operational systems which capture voter registration and other changes.
This is simplified, but I hope it captures the basic idea.

Damir Sudarevic
good thought - I'll check out the book. I have one of Kimball's, but not the one you are referring to. I'm going to edit my post to clarify my issue a bit, though, because we don't really have anything to measure.
sql_mommy
I think this makes sense for a lot of what I'm working on. I guess my question here would be - is there a problem with a dimension table that is the same length as the fact table. i.e., fact = registrations (with party, location, voterID, etc), while the voter dimension would be exactly the same length.
sql_mommy
The book is "The Data Warehouse Toolkit".Dimension and facts are just wording. You should try to arrange table layout to support your business requirements (reporting, analytics) as efficiently as possible..
Damir Sudarevic
A: 

ok - this isn't a full "answer", but it's close.

Notice this blog entry describing a Kimball lecture: http://database-geek.com/2005/03/28/a-day-with-ralph-kimball-part-2/

The reason I'm struggling is that this is a "degenerate" dimension. My voter regnum and associated information is one to one with my "registration" fact table. So it appears that it's even ok with Kimball to throw that into the fact table.

So now I'm just looking into what happens when a fact table is used by another fact table.

EDIT: Also, I have found googling the term "monster dimension" to be very helpful. This is much like a slowly changing customer dimension. As long as I am willing to snowflake, I can achieve what I need - star joins when querying voter, and not causing problems to use voter as a dimension for various fact tables.

EDIT: Here was my final conclusion: As advised above, the point is to facilitate business process, not to fit the textbook diagram.

Our business is such that there is absolutely no reason to split apart the voter table (having a fact table for "registrations" and a dimension for "voters") - when querying with that table we will want all the attributes as well as all the flags and text information. I would not want to break the attributes out separately into the "fact" (like Kimball's book shows for customers and orders) because those attributes mean something different when attached to the facts as than when they are attached to the dimensions. Further, voters is used as an attribute in multiple other places, some of which DO fit a traditional star.

My main purpose is SPEED. So I chose a modified format - a lot like the snowflake - where voter is the center of multiple tables and oracle can use the star join when I index everything right. Then, I use voter as a dimension in all my other "stars". In every case, I set it up so that most if not all tables can be joined using the star join, even though it isn't "textbook."

Thanks again for the help!

sql_mommy