views:

193

answers:

2

I have a table of customers. This table can be big, millions of rows. Every hour, I have a console app that goes through all the customers and updates another table with customer changes that occured in the last hour.

What I want to do is two things: (1) Have the console app (or SSIS package) be multi-threaded so that I can have a couple of these running so that the update finishes quicker. It needs to be smart and be able to select customers in a way so that two threads never execute code for the same customer.

(2) If a new customer signs up, he should not have to wait an hour or until the console app gets to him to get data. There should be another process/app that has a queue of new customers and runs analysis on that. The problem is that I want to prevent this queue from running code on a customer if the hourly process is also running code on that process.

From what I've read on previous threads on here, SSIS is what I want to use. Is this correct? Is SSIS the right tool?

I've been looking into SSIS, and am not sure where to start and how to do the multi-threaded approach for this. I'd love to have some rough pointers so that I have a better idea of where to start and what to research.

Thanks!

+2  A: 

SSIS is the right tool. The right way to go about it is to create your Data Flow task and don't worry about multithreading. SSIS will take care of that for you, especially SSIS 2008.

The extent to which you have to worry is that you will want to be aware that certain data flow transforms require all the input rows to be processed before they produce any output at all. A Sort transform, for instance, has to wait until all the rows have been passed to it, do the sort, and then start passing out sorted rows. If you avoid such transforms, then SSIS will be very happy to optimize the number of threads being used.


On the priority issue, I haven't thought this through, but how about having the batch job ignore new customers. Have another SSIS package the runs more frequently, to do the processing on the new rows, then flag that they've been processed for the first time. That way, there can be no conflict.

John Saunders
A: 

I'm not sure SSIS is the way to go. You can do it using pure T-SQL.

Essentially you want something to traverse rows in a non-blocking yet exclusive way, with a separate process to detect new rows. Each row "locked" requires further processing.

Use UPDLOCK, ROWLOCK, READPAST per console app to manage row selection: "Processing Data Queues in SQL Server with READPAST and UPDLOCK".

Each process now owns a row (non-blocking yet exclusive) and it can be worked on.

The transfer of data to another table could be done in the same code, eg stored proc.

You could even use a WHILE loop inside to just carry on processing. Set it running, that's it. If it finds no rows, you could build in a WAITFOR.

Personally, I wouldn't use SSIS.

gbn
Please say _why_ you would not use SSIS. In particular, how do your techniques address his concerns about performance?
John Saunders
Easy. SSIS is cumbersome for this kind of task and you'll still need SQL to get data in a safe concurrent manner. So why not do it in SQL instead of offloading to another tool? Given it's a INSERT.. SELECT on a some exclusive rows, isn't this what SQL is designed for? What can SSIS do that raw T-SQL can not?
gbn