views:

353

answers:

2

Hello,

I need to load Dimensions from EDW Tables (which does maintain historical records) and is of type Key-Value-Parameter.

My scenario is ok if got a record in EDW as below

Key1  Key2   Code     Value     EffectiveDate           EndDate        CurrentFlag
100   555     01      AAA       2010-01-01 11.00.00     9999-12-31         Y
100   555     02      BBB       2010-01-01 11.00.00     9999-12-31         Y

This need to be loaded into DM by pivoting it as

key1 and key2 combinations makes Natural key for DM

 SK    NK       01     02        EffectiveDate        EndDate      CurrentFlag
 1    100-555   AAA    BBB       2010-01-01 11.00.00  9999-12-31        Y

My ssis package does this all good pivoting... looking up the incoming NK in DIM.. if new will insert .. else with further lookup with effective date and determine if the incoming for same natural key got any new (change) in attribute.. if so updates the current record byy setting its end date and insert the new one with new attribute value and pulling the recent records values for other attributes.

My problem is if the same natural key comes twice with same attribute in single extract my first lookup which on natural key .. will let both records pass and try to insert.. where its fails. If i get distinct records on NK the second is not picked and need to run package again.

So my question how can i configure lookup or alernative way to handle this scenario when same NK comes twice in single extract, would be able to insert first record if not exists in Dim table and for second one should be able to updated with the changes with reference to one inserted above.

Not sure this makes sense what am trying to explain. Will attached the screenshot once back to work desk (on monday).

Thanks

A: 

Lookup is not good for this - with caching and everything, it's just not able to lookup on it's previously set values.

You might be better off passing it to a SQL Command task and having a stored proc do an insert or expire/insert depending on what it finds.

You could also stream them to a table and do it in a batch.

To address your flow and the model it's trying to populate:

To start with, it's always awkward when the order of rows in the input causes behavioral differences - i.e. NK = A, Val = 1 then NK = A, Val = 2 gives different behavior than NK = A, Val = 2 then NK = A, Val = 1. One has to wonder if this is the correct dimensional design. Remember that all dimensional attributes are assigned to dimensional tables based on a pragmatic choice. Ultimately dimensions can be arranged into tables at will - so a different design might make more sense. If things are changing within a single load, that may indicate you need to break up that load to match the grain (not attempting to load 2 days data at one time).

I notice that there is an Effective Date and an End Date in your dimension. Right now this sounds a lot like a property of the dimension behavior (where your 01 and 02 codes are changing on a NK) and not of the facts that this dimension is going to be attached to. This might indicate that it needs to be tracked in a separate factless fact table, say (SK, EffectiveDate, EndDate) - or that it's just not important, because all you care about is a NK, 01, 02 combination attached to a fact, in which case, your natural key is really all of NK, 01, 02.

I recommend going back to your fact table and the incoming feed and the expected usage and looking at it more closely and see if this maybe need to be a separate factless fact table tracking these dimension changes.

Also, if you could post more details, that might help, and I'll see what Kimball's materials have to say about it when I see more of a business case.

Cade Roux
Thanks Cade, that makes sense. If my approach is as joining two datasources, pivoting and then passing them through script component to work out the source records by pulling other values of the new incoming attributes=> dump into staging table and merge with actual dim. What would be the performance hit using Script component this is scenario.
Nev_Rahd
@Nev_Rahd Anything done on a row-by-row basis is going to be slower against the DB. I think you really need to look at your model. I've added some more material. The model should match the feed to the extent that how the data is modeled should reflect how the data is behaving in terms of sub-codes "wiggling" over time. I think this is a modeling issue more than a ETL issue, because I don't think when you query the fact tables related to this dimension that you are going to be using Effective Date and End Date because they relate to the code assignments and not to the (current) facts.
Cade Roux
@Nev_Rahd Because the facts in the feed which are going to be assigned to these dimensions are physically assigned to the dimensions by the SK, so the effective date isn't going to matter when querying the DB.
Cade Roux
A: 

Cade's comments are spot on - but I believe your main issue is the duplicates, period. Does the fact that you have two versions of the same NK in the "source" stream indicate two separate, meaningful versions? Or does only the "last" version matter?

If changes reflected in both versions should be reflected in your dimension table, then I echo Cade's suggestion to divide your processing into batches. You could sort your input by the NK (and time of change), then use a row counting script to enumerate the "versions" of each NK, then process "batches" by version number.

If only the last "version" need be incorporated into the dimension table, I suggest you eliminate duplicates prior to using a Lookup.

Todd McDermid