views:

226

answers:

2

Need help big time ....

I need to create a .net application that will perform some bulk operations on , say around 2,000,000 records, in a table. There is a window of opportunity in which the application should run and try to process as many rows as it can. I am thinking that if I can have multiple threads of the app take 2000 records at a time & process them, it should be able to process more. However, that will be pretty costly on the database server. From what I am told, the db server is a beefy machine and should be able to handle the pressure.

Also, by only taking 2000 rows at a time, should the application terminate in the middle of its processing, it would know where to pick up again.

So, I guess what I am asking is ...

1) How can I have the app pick up rows & lock those rows so that it would not be assigned to another thread?

2) What kind of intelligence can be programmed into the app that will allow for it to pick up processing where it last left off from?

Thanks

KP

+1  A: 

Rather than reinvent the wheel, you should do this using SQL Server Integration Services (SSIS). It is highly optimized for these scenarios, especially in the 2008 release.

John Saunders
John, this is the direction I initially wanted to go. But the DBAs of this company are afraid of SSIS. They can't provide a solid reason in regards to why they DO NOT want to use SSIS.
Well, then they're fools. SSIS is optimized better for these scenarios much better than anything you or your DBAs are likely to build. It's one of it's jobs, so to speak. They may be thinking this is still the old DTS, in which case, they should pay closer attention to progress, when it happens.
John Saunders
+1  A: 

I agree with John that SSIS has a lot of built in intelligence for such scenarios and is probably the best bet to invest your time into.

For the record such problems you approach by partitioning your data. I'm not talking about the physical storage partitioning (ie. add table partitioning), but logical, processing partitioning. You partition your 2 mil. records in N partitions, based on whatever criteria you have that can be exploited at the data access level, eg. an indexed column, then allocate N processors that start churning each on its own partition. The idea is to not have the processors overlap in trying to access the same rows. 'Processors' can be threads, or better still ThreadPool queued up worker items that use async database access methods.

The big problem is that many times you don't have a suitable partitioning key. In such cases you can do an ad-hoc partitioning like this:

with cte as (
   select top (@batchSize) *
   from myTable with (rowlock, updlock, readpast)
   where <record is ready to be processed>)
update cte
   set <mark record processing>
output inserted.*

The trick is the locking hints used in the select: by forcing and updlock the records are locked for processing by the current processor. By adding the readpast hint each processor will skip records that are already locked by other processors. This way each processor gets its own @batchSize batch of records to process, whatever the processing is.

Important to understand that all these comments apply to a processing that involves something outside the database, like doing a web service call, printing a paper slip or anything similar. If the processing is all in the database, then you should just express it as a single T-SQL update and let the query optimizer use parallel querries it as it sees fit.

Remus Rusanu
John, this is the direction I initially wanted to go. But the DBAs of this company are afraid of SSIS. They can't provide a solid reason in regards to why they DO NOT want to use SSIS.