views:

177

answers:

2

Let's say I've got a SQL Server database table with X (> 1,000,000) records in it that need to be processed (get data, perform external action, update status in db) one-by-one by some worker processes (either console apps, windows service, Azure worker roles, etc). I need to guarantee each row is only processed once. Ideally exclusivity would be guaranteed no matter how many machines/processes were spun up to process the messages. I'm mostly worried about two SELECTs grabbing the same rows simultaneously.

I know there are better datastores for queuing out there, but I don't have that luxury for this project. I have ideas for accomplishing this, but I'm looking for more.

+7  A: 

I've had this situation.

Add an InProcess column to the table, default = 0. In the consumer process:

UPDATE tbl SET Inprocess = @myMachineID WHERE rowID = 
    (SELECT MIN(rowID) WHERE InProcess = 0)

Now that machine owns the row, and you can query its data without fear. Usually your next line will be something like this:

SELECT * FROM tbl WHERE rowID = 
    (SELECT MAX(rowID) FROM tbl WHERE ProcessID = @myMachineID)

You'll also have to add a Done flag of some kind to the row, so you can tell if the row was claimed but processing was incomplete.

Edit

The UPDATE gets an exclusive lock (see MSDN). I'm not sure if the SELECT in the subquery is allowed to be split from the UPDATE; if so, you'd have to put them in a transaction.

@Will A posts a link which suggests that beginning your batch with this will guarantee it:

SET TRANSACTION ISOLATION LEVEL READ COMMITTED

...but I haven't tried it.

@Martin Smith's link also makes some good points, looking at the OUTPUT clause (added in SQL 2005).

One last edit

Very interesting exchange in the comments, I definitely learned a few things here. And that's what SO is for, right?

Just for color: when I used this approach back in 2004, I had a bunch of web crawlers dumping URLs-to-search into a table, then pulling their next URL-to-crawl from that same table. Since the crawlers were attempting to attract malware, they were liable to crash at any moment.

egrunin
+1 You need out-of-band cleanup for the case where the consumer app does not correctly transition between "InProcess" and "Done"
Steve Townsend
Any reason why @myMachineID can't just be @@SPID, of course assuming that both queries are executed in the same batch?
Will A
Does the UPDATE lock the selected rows while updating them or could multiple processes claim the rows simultaneously?
John Sheehan
@Steve Townsend: yes. If you add a timestamp column (`StartTime`) then the server can tell when something's died and the row should get reset.
egrunin
@Will A: mine weren't, so I never tried it.
egrunin
@egrunin - fair enough, am sure it'd be fine if both queries were in the same batch - and batching the pair would save on a round-trip as well.
Will A
@Steve - my reading of the READ COMMITTED isolation level is that one and only one process can claim a row using the SQL above. http://msdn.microsoft.com/en-us/library/ms173763.aspx
Will A
I'm going to have to reread my SQL Server Internals. :)
Will A
@Will A - my concern is not >1 process claiming the row, it's the case where process claims row and then does not complete required processing due to some unexpected condition (process crash, network error)
Steve Townsend
@Martin Smith: you're right, MSDN doesn't link the 2k version to 2k5 and 2k8, so I missed it.
egrunin
@egrunin Hmm, After messing around with Profiler looking at the lock events I think my main objection was wrong. I thought the sub query would just get an `S` lock which doesn't seem to be the case sorry! Still I'll leave the link to another approach http://rusanu.com/2010/03/26/using-tables-as-queues/
Martin Smith
Just FYI - have played with this approach in .Net / SQL Server - if you want to guarantee that only one process can pick up a particular row then you'll need a ` WITH(TABLOCKX)` after then first occurrence of `tbl` in the `UPDATE` query.
Will A
A: 

I'd consider having the process fetch the top N number of records whose "processed" flag is zero into a local collection. I would actually have three values for the processed flag: NotProcessed (0), Processing (2), Processed (1). Then, loop through your collection and issue the following sql:

update table_of_records_to_process
set processed = 2
where record_id = 123456
and processed = 0

...that way, if some other process has grabbed that record ID already, then it will not set the processed field to 2. You'll want to verify that record ID 123456 is truly set to 2:

select count(*)
from table_of_records_to_process
where record_id = 123456
and processed = 2

...then you can process that one. If the count returned is zero, then you'll move on to the next record in your collection and try again. If you get to the end of your collection and some other process already modified all those records, go fetch N more records.

Nick DeVore