views:

1093

answers:

2

I have a million rows in a database table. For each row I have to run a custom exe, parse the output and update another database table

How can I run process multiple rows in parallel?

I now have a simple dataflow task ->GetData->Run Script (Run Process , Parse Output)->Store Data For 6000 rows it took 3 hours.Way too much.

+1  A: 

If you are executing this process using the "data flow" container, then there is a property on it called "EngineThreads" which defaults to a value of 5. You can set it to a higher number like 20, which will devote more threads to processing those rows.

That is just a performance tweak or optmisation, if your ssis package is still running really slowly then I would perhaps address the architecture and design of your package.

Jobo
+2  A: 

There is the single bottleneck here, running the process per each row. Increasing "EngineThreads" would not help at all, as there will be only one thread running this particular script transform anyway. The time spent in other transforms probably does not matter at all. Processes are heavy weight objects, and running thousands of them will never be cheap.

I can think of following ideas to make it better:

1) The best way to fix it is to convert your custom EXE into an assembly and call it from the script transform - to avoid the overhead of creating processes, parsing the output etc.

2) If you have to use the separate processes, you can try to run these processes in parallel. It will help if the process mostly waits for some input/output (i.e. it is I/O bound). If the processes are memory bound or CPU bound, you would not win much by running them in parallel.

2A) Complex script, simple package.

To run them in parallel, modify the ProcessInput method in your script to start the process asynchronously, and don't wait for the process completion - move to the next row and create the next process. Subscribe to process output and process Exited event, so you know when it has finished. Limit the number of processes run in parallel - otherwise you'll run out of memory. Wait until all the processes are done before returning from ProcessInput call.

2B) Simple script, complex package.

Keep the current sequential script, but partition the data using SSIS. Add conditional split transform, and split the input stream into multiple streams, based on some hash expression - something that will make each output to receive approximately the same amount of data. The number of streams equals the number of process instances you want to run in parallel. Add your script transform to each output of conditional split. Now you should also increase "Engine Threads" property :) and these transforms will run in parallel. (Note: based on tag, I assume you use SSIS 2008. You'll need to insert additional Union All transforms to make it work in SSIS 2005).

This should make it perform better, but millions of processes is a lot. You'll hardly get really good performance here.

Michael

related questions