I've had similar problems over the past few weeks, and here are several things you could consider, listed in decreasing order of importance according to what made the biggest difference for us:
Don't assume anything about the server.
We found that our production server's RAID was miscconfigured (HP sold us disks with firmware mismatches) and the disk write speed was literally a 50th of what it should be. So check out the server metrics with Perfmon.
Check that enough RAM is allocated to SQL Server. Inserts of large datasets often require use of RAM and TempDB for building indices, etc. Ensure that SQL has enough RAM that it doesn't need to swap out to Pagefile.sys.
As per the holy grail of SSIS, avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model. The only difference between Simple and Full recovery models is that Simple automatically truncates the log file after each transactions. This means that large datasets, when manipulated with T-SQL, thrash the log file, killing performance.
For large datasets, do data sorts at the source if possible. The SSIS Sort component chokes on reasonably large datasets, and the only viable alternative (nSort by Ordinal, Inc.) costs $900 for a non-transferrable per CPU license. So... if you absolutely have to a large dataset then consider loading it into a staging database as an intermediate step.
Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.
Increase the network packaet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source server/s, and can noticably improve reads on large datasets.
If using Lookup transforms, experiment with cache sizes - between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.
If combining multiple large datasets, use either RAW files or a staging database to hold your transformed datasets, then combine and insert all of a table's data in a single data flow operation, and lock the destination table. Using staging tables or RAW files can also help relive table locking contention.
Last but not least, experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You'll need to monitor your package's "Buffers Spooled" performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.
Point 8 is especially important on very large datasets, since you can only achieve a minimally logged bulk insert operation if:
- The destination table is empty, and
- The table is locked for the duration of the load operation.
- The database is in Simply / Bulk Logged recovery mode.
This means that subesquent bulk loads a table will always be fully logged, so you want to get as much data as possible into the table on the first data load.
Finally, if you can partition you destination table and then load the data into each partition in parallel, you can achieve up to 2.5 times faster load times, though this isn't usually a feasible option out in the wild.