views:

110

answers:

5

I'm writing a background service that needs to process a series of jobs, stored as records in a sqlserver table. The service needs to find the oldest 20 jobs that need to be worked (where status = 'new'), mark them (set status = 'processing'), run them, and update the jobs afterward.

It's the first part I need help with. There could be multiple threads accessing the database at the same time, and I want to make sure that the "mark & return" query runs atomically, or nearly so.

This service will be spending comparatively little time accessing the database, and it's not the end of the world if a job gets run twice, so I might be able to accept a small probability of jobs running more than once for increased simplicity in the code.

What is the best way to do this? I'm using linq-to-sql for my data layer, but I assume I'll have to drop down into t-sql for this.

+7  A: 

Please see my answer here: SQL Server Process Queue Race Condition which also manages 20 rows in one go.

Basically, it's quite simple in SQL Server to manage concurrency and polling using the hints ROWLOCK, READPAST and UPDLOCK.

I can't comment about Linq, but a transaction still leaves you open to concurrency issues: you need to use the hints I mentioned

gbn
+1  A: 

I know it's off topic but for this you could use MSMQ. A message queue would put your jobs in sequence and it's thread safe. You can also assign priority whish MSMQ manage itself. You can use read or peek to delete a message from the queue or just see what there. You can use the command design pattern to help you with this.

Stéphane
Queueing is the answer, but why MSMQ when SQL Server comes with built in queues?
Remus Rusanu
The manner I use them is to control processes. When I queue something I don't use the database at all. So any listner can get a job to do. And I tested it with 5 computers running 10 process each and I never got a concurrency problem. I guess it depends where you want you queue to reside.
Stéphane
A: 

Is it not just as simple as running your T-SQL within a transaction, or am I missing something?

birdus
+4  A: 

Building on gbn's answer...

If you're using SQL Server 2005 or newer, you can return the updated rows atomically by using an OUTPUT clause in your UPDATE statement:

UPDATE TOP (20) your_table
SET status = 'processing'
OUTPUT INSERTED.*
FROM your_table WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE status = 'new'
LukeH
+6  A: 

Your table of jobs is a queue. Writing user tables backed up queues is a notoriously error prone as it leads to deadlocks and concurency issues.

The simplest thing would be to drop the user table and use a true queue instead. This will give you deadlock free concurency free queue on system tested and validated code base. The problem is that the whole paradigm around queues changes from INSERT and DELETE/UPDATE to SEND/RECEIVE. On the other hand with built-in queue you get some very powerfull free goodies, namely Activation and correlated items locking.

If you want to continue down the path of user table backed queues then the second most important trick in writing user tables queues is to use UPDATE ... OUTPUT:

WITH cte AS (
  SELECT TOP(20) status, id, ...
  FROM table WITH (ROWLOCK, READPAST, UPDLOCK)
  WHERE status = 'new'
  ORDER BY enqueue_time)
UPDATE cte
  SET status = 'processing'
OUTPUT
  INSERTED.id, ...

The CTE syntax is just for convenience of placing the TOP and ORDER BY properly, the query can be written using derived tables just as esily. You cannot use straight UPDATE ... TOP because UPDATE does not support an ORDER BY and you require this to satisfy the 'oldest' part of your requirement. The lock hints are needed to facilitate high concurency between parallel processing threads.

I said this is the second most important trick. The most important is how you organize the table. For a queue it must be clustered by (status, enqueue_time). If you don't organize the table properly you'll end up with deadlocks. Pre-emptive comment: fragmentation is irelevant in this scenario.

Remus Rusanu