I'm trying to use Pentaho Data Integration / Pan to make a type-2 (SCD) table.
After I ran my data through a DimensionLookup step a first time, I got a type-2 table that looks like:
PKID version date_from date_to MyPK Field1 Field2 ...
150 1 1900-01-01 00:00:00 2199-12-31 23:59:59 1 "X" "Y" ...
So far so good. Then later I got new data, updated my original table to it, and ran it through the same DimensionLookup step, and all the modified records looked like:
PKID version date_from date_to MyPK Field1 Field2 ...
150 1 1900-01-01 00:00:00 2010-06-24 03:45:00 1 "X" "Y" ...
151 2 2010-06-24 03:45:00 2199-12-31 23:59:59 1 "X2" "YYY" ...
That looks good, too. But for records that were added to my dataset since last time, they get added with date_from='1900-01-01 00:00:00'. (And come to think of it, it's a bit strange that my initial load used date_from='1900-01-01 00:00:00', too.)
I'm having it use the default start date, which claims to be the system's current datetime, though I've also tried "start_trans" ("Start date of transformation"), which did the same thing except using a timestamp 9 hours earlier (on 1899-12-31).
I can't see how using 1900 for all new records makes sense: you can't tell what was added when, since it looks like every new record has existed since the start of time. I even looked up type-2 in Kimball's book, and while it wasn't super explicit, it seemed to indicate that newly inserted values should have as their date_from the time they start being valid.
Is this a bug in PDI? Is there a workaround? Am I just using it wrong? Am I misunderstanding the whole point of type-2?
Thanks!