views:

11

answers:

1

I'm trying to use Pentaho Data Integration / Pan to make a type-2 (SCD) table.

After I ran my data through a DimensionLookup step a first time, I got a type-2 table that looks like:

PKID  version  date_from             date_to                MyPK  Field1  Field2  ...
150   1        1900-01-01 00:00:00   2199-12-31 23:59:59    1     "X"     "Y"     ...

So far so good. Then later I got new data, updated my original table to it, and ran it through the same DimensionLookup step, and all the modified records looked like:

PKID  version  date_from             date_to                MyPK  Field1  Field2  ...
150   1        1900-01-01 00:00:00   2010-06-24 03:45:00    1     "X"     "Y"     ...
151   2        2010-06-24 03:45:00   2199-12-31 23:59:59    1     "X2"    "YYY"   ...

That looks good, too. But for records that were added to my dataset since last time, they get added with date_from='1900-01-01 00:00:00'. (And come to think of it, it's a bit strange that my initial load used date_from='1900-01-01 00:00:00', too.)

I'm having it use the default start date, which claims to be the system's current datetime, though I've also tried "start_trans" ("Start date of transformation"), which did the same thing except using a timestamp 9 hours earlier (on 1899-12-31).

I can't see how using 1900 for all new records makes sense: you can't tell what was added when, since it looks like every new record has existed since the start of time. I even looked up type-2 in Kimball's book, and while it wasn't super explicit, it seemed to indicate that newly inserted values should have as their date_from the time they start being valid.

Is this a bug in PDI? Is there a workaround? Am I just using it wrong? Am I misunderstanding the whole point of type-2?

Thanks!

A: 

You want to add into the transformation a new field or variable with the current system time.

There are several ways to do it, so I'll just suggest one. Add a new transformation at the beginning of the job called set_current_time. This transformation will get the current system time (you can get it from the database using a Table Input step). Then you use Set Variable to set the current time as a variable that will be available throughout the job.

From here, it's fairly easy to find if you need to use the variable or the previously available date as the date_from field, using a combination of a lookup and a formula/script.

OmerGertel

related questions