views:

51

answers:

2

There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.

The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.

Basically, the process:

  1. Fills an "archive" database that has the same schema as the database of origin
  2. Delete the original rows from the source database

I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?

I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.

+1  A: 

SSIS is probably not the right choice if you are simply deleting records.

This might be of interest: Performing fast SQL Server delete operations

UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.

Mitch Wheat
+2  A: 

SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.

There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.

The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).

For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).

I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.

Remus Rusanu
I'd upvote you a million times if I could. Boy I'd love to see that Package that moved 2 terrabytes an hour.
HLGEM
@HLGEM that's what she said
Chris McCall