views:

118

answers:

1

I am trying to improve the performance of a SSIS Package.

One thing I got startted with is to filter the reference table of the Lookups. Until now, I was using a table as a reference table for that lookup.

First improvment was to change the table to a SQL clause that is selecting just the columns I need from that table.

Next, I want to load in this table just the records I know I'll use for sure. If I'm maintaining it in this state, I will get to load 300 000 lines or more (huge lines with binary content of around 500 kb each) and use just around 100 of them.

I would put some filters in the SQL query that sets the reference table of the lookup, BUT, in that filter I need to use ALL the ids of the rows loaded in my OLE DB source.

Is there any way to do this?

I thought of loading each row at a time using a OleDB Command instead of a Lookup, but except of beeing time consuming, I might get to load the same thing 100 times for 100 different rows, when I could just load it once in the lookup and use it 100 times...

Enableing the cache still would be another option that still doesn't sound very good, because it would slow us down - we are already terribly slow.

Any ideeas are greatly appreaciated.

A: 

One possibility is to first stream the distinct IDs to a permanent/temporary table in one data flow and then use it in your lookup (with a join) in a later data flow (you probably have to defer validation).

In many of our ETL packages, we first stream the data into a Raw file, handling all the type conversions and everything on the way there. Then, when all these conversions were successful, then we handle creating new dimensions and then the facts linking to the dimensions.

Cade Roux
Ok Cade, thanks for your suggestion, I think it might work allright. What do you think about the caching option? Have you used this? You think it might help me improve things, or it will just slow me down?
Diana
@Diana We are only using 2005 in production and we also have case-sensitivity issues, so we use them only in the fully-pre-loaded mode. I don't remember what that's called - the nomenclature in the UI is misleading as to the actual behavior. I guess my point about the Raw file was that you often have to do a lot of out of the box thinking about the process in order to get performance where you want it. In our case, doing the conversions once and saving the results was far quicker than doing the conversions to check the data was good and then doing the conversions again on load.
Cade Roux
@Diana Even though we were streaming out multi-GB Raw files and then reading them again - it was far faster overall than just reading again and converting again.
Cade Roux

related questions