views:

56

answers:

2

I'm currently working on a project where we have a large data warehouse which imports several GB of data on a daily basis from a number of different sources. We have a lot of files with different formats and structures all being imported into a couple of base tables which we then transpose/pivot through stored procs. This part works fine. The initial import however, is awfully slow.

We can't use SSIS File Connection Managers as the columns can be totally different from file to file so we have a custom object model in C# which transposes rows and columns of data into two base tables; one for column names, and another for the actual data in each cell, which is related to a record in the attribute table.

Example - Data Files:

alt text

Example - DB tables:

alt text

The SQL insert is performed currently by looping through all the data rows and appending the values to a SQL string. This constructs a large dynamic string which is then executed at the end via SqlCommand.

The problem is, even running in a 1MB file takes about a minute, so when it comes to large files (200MB etc) it takes hours to process a single file. I'm looking for suggestions as to other ways to approach the insert that will improve performance and speed up the process.

There are a few things I can do with the structure of the loop to cut down on the string size and number of SQL commands present in the string but ideally I'm looking for a cleaner, more robust approach. Apologies if I haven't explained myself well, I'll try and provide more detail if required.

Any ideas on how to speed up this process?

+1  A: 

One thought - are you repeatedly going back to the database to find the appropriate attribute value? If so, switching the repeated queries to a query against a recordset that you keep at the clientside will speed things up enormously.

This is something I have done before - 4 reference tables involved. Creating a local recordset and filtering that as appropriate caused a speed up of a process from 2.5 hours to about 3 minutes.

Thomas Rushton
Hi Thomas. Yes, there are a number of repeated queries which are executing each time to find out attribute IDs etc. It should definitely help to load them up beforehand. I think there's still more we can do though, I'm wondering if there's a better way of actually performing the insert of the data rather than building a dynamic string?
fat_tony
I doubt it, to be honest, without going down the route of writing out to a text file and then calling the bulk import (filestream) routines. And they'll make for a system with additional failure points.
Thomas Rushton
Yeah, that makes sense actually. I'm hoping to try and make a few changes this afternoon so I'll let you know how this all goes. Thanks
fat_tony
Good luck! Let us know how you get on.
Thomas Rushton
I'm still working on getting everything up and running but I've cut the number of DB calls right down and am continuing to reorganise code to remove more. It definitely looks like it's making a big difference so far though. Thanks for the help!
fat_tony
Excellent news. You're welcome.
Thomas Rushton
+1  A: 

The dynamic string is going to be SLOW. Each SQLCommand is a separate call to the database. You are much better off streaming the output as a bulk insertion operation.

I understand that all your files are different formats, so you are having to parse and unpivot in code to get it into your EAV database form.

However, because the output is in a consistent schema you would be better off either using separate connection managers and the built-in unpivot operator, or in a script task adding multiple rows to the data flow in the common output (just like you are currently doing in building your SQL INSERT...INSERT...INSERT for each input row) and then letting it all stream into a destination.

i.e. Read your data and in the script source, assign the FileID, RowId, AttributeName and Value to multiple rows (so this is doing the unpivot in code, but instead of generating a varying number of inserts, you are just inserting a varying number of rows into the dataflow based on the input row).

Then pass that through a lookup to get from AttributeName to AttributeID (erroring the rows with invalid attributes).

Stream straight into an OLEDB destination, and it should be a lot quicker.

Cade Roux
I've changed the correct answer to this one as I've finally had a chance to take a proper look at optimising this SSIS package and this solution has provided by far the optimal solution. The other solution was helpful and did improve things so I've voted it up but this one gave a 99% speed improvement. As such, it only seemed fair to mark this one as the correct answer.
fat_tony
@fat_tony For the benefit of later readers, and if you have time, could you please update your question to indicate more detail about the structure of the final solution you implemented? My answer didn't spell out an exact solution.
Cade Roux
@Cade Roux - Good suggestion. I'll update my question when I get a chance and give some more detail on the solution. Cheers for the help, sorry it's taken so long to get round to responding!
fat_tony