views:

68

answers:

3

I'm developing a data warehouse and have come up against a problem I'm not sure how to fix. The current schema is defined below:

DimInstructor <- Dimension table for instructors DimStudent <- Dimension table for students

I want to implement a scenario whereby if details of an instructor change in my OLTP database, I want to add a new record in the DimInstructor table for historical reporting reasons.

Now, I'm wanting to create a lesson dimension table called DimLesson. In DimLesson I want to create a reference to the instructor.

The DimInstructor table contains:

InstructorDWID <- Identity field when entered into DW InstructorID <- The instructor ID that has come from the OLTP database

Now, I can't make InstructorID a primary key because it isn't guaranteed to be unique (if the instructor changes their name, there will be 2 records in the DW with the same InstructorID value).

So my question is, how do I reference the instructor from DimLesson? Do I use the InstructorDWID? If so, should I have 2 entries for an instructor in DimInstructor, it would make queries more complicated when I'm wanting to look at all lessons by a specific instructor.

Any help would be appreciated!

A: 

Use a guid/uuid as the primary key or a combination of columns

Don
You mean the InstructorDWID? This value will be unique as it is an identity column. However, if the Instructor details change that instructor will have more than one InstructorDWID. Example - InstructorDWID is currently 1, then the instructor changes her title from Miss to Mrs. We now have InstructorDWID of 1 and 2. 1 is now obsolete and 2 is current. What happens to the lessons that are referencing InstructorDWID 1 now that it is obsolete?
Paul
A: 

Paul,

There are multiple ways you can handle this. You can use an effective date/inactive date, sequence number or a version number to differentiate the records with the same InstructorID.

The DIM that captures all relevant details would be like..

create table DIM_INSTRUCTOR(
  instr_guid number, --populated through a sequence     -----Composite pk-Part1
  istr_oid   number, --direct id from the OLTP system   -----cmposite  pk-part2
  instr_name number,
  other_attr varchar2(25),
  eff_date   date,
  expiration_date date
);

instr_guid is directly generated from a sequence and is independent of the OLTP system.

This would let you capture all the details for a given instructor. You can use just the instr_guid as the foreign key to the fact table, but including both of them (instr_guid,instr_guid) would increase the ease of querying .. which is one of the goals of Datawarehousing.

Useful Links:

http://en.wikipedia.org/wiki/Surrogate_key http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2

Rajesh
Thanks for that. How would I go about referencing the key from another dimension table? So the DimLessons table contains all lessons for a particular instructor. The lessons table functions in the same way, using Type 2.
Paul
Dimension tables are (generally ) not supposed to reference each other. They are all independent entities and it is the fact table that references these tables.From what I understand, your scenario would have a class enrollment at the fact level. Each class enrollment would be a record in the fact table.Students_dim, instructors_dim, classes_dim will contain the corresponding attributes.The enrollment_fact would contain keys from each of these tables and all the other details like enrolllment_date and so on.
Rajesh
I think I understand. So if I'm wanting to create the schema based on Instructors, Students, Lessons and Lesson Bookings each of the dim tables (instructors, students, lessons) would be independent of each other and link via the fact table? That would make sense, but what if a report was generated displaying lessons by an instructor which nobody has attended? How would I link instructor to a lesson if there are no records in the fact table as nobody attended?
Paul
A: 

What you are describing here is usually called type 2 dimension. Kimball data warehouse books have whole sections on type 2 dimensions and ETL for the type -- do read.

The first thing to understand is the difference between the primary key and the business key. The primary key uniquely identifies a row in the table, while the business key uniquely identifies an entity that the table describes, like an instructor. For example, if an instructor changes name, the dimInstructor table may look something like:

InstructorKey  InstructorBusinessKey  FirstName LastName  row_ValidFrom row_ValidTo   row_Status
  1234           jane_doe_7211           Jane     Doe       2000-03-11   2010-08-12     expired
  7268           jane_doe_7211           Jane     Smith     2010-08-12   3000-01-01     current

Now, providing that the dimLesson is proper design for your business model (as opposed to having some kind of fact) the dimLesson would have a column called InstructorKey. During ETL process, when delivering the new row (7258) to the dimInstructor table, replace all references to row 1234 in the dimLesson with 7268 .

Damir Sudarevic
Thanks Damir. The dimLesson table will be designed similar to the dimInstructor table. An example report may be based upon did lesson bookings increase or decrease after the lesson changed names?I think the method you explained seems to make most sense.
Paul