views:

132

answers:

4

Recently I had to deal with a problem that I imagined would be pretty common: given a database table with a large (million+) number of rows to be processed, and various processors running in various machines / threads, how to safely allow each processor instance to get a chunk of work (say 100 items) without interfering with one another?

The reason I am getting a chunk at a time is for performance reasons - I don't want to go to the database for each item.

A: 

You can have a special table to queue work up, where the consumers delete (or mark) work as being handled, or use a middleware queuing solution, like MSMQ or ActiveMQ.

Middleware comes with its own set of problems so, if possible, I'd stick with a special table (keep it as small as possible, hopefully just with an id so the workers can fetch the rest of the information by themselves on the rest of the database and not lock the queue table up for too long).

You'd fill this table up at regular intervals and let processors grab what they need from the top.

Related questions on SQL table queues:

http://stackoverflow.com/questions/280515/queue-using-table

http://stackoverflow.com/questions/465692/working-out-the-sql-to-query-a-priority-queue-table

Related questions on queuing middleware:

http://stackoverflow.com/questions/798748/building-a-high-performance-and-automatically-backupped-queue

http://stackoverflow.com/questions/960630/messaging-platform

Vinko Vrsalovic
+1  A: 

There are a few approaches - you could associate each processor a token, and have a SPROC that sets that token against the next [n] available items; perhaps something like:

(note - needs suitable isolation-level; perhaps serializable: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE)

(edited to fix TSQL)

UPDATE TOP (1000) WORK
SET [Owner] = @processor, Expiry = @expiry
OUTPUT INSERTED.Id -- etc
WHERE [Owner] IS NULL

You'd also want a timeout (@expiry) on this, so that when a processor goes down you don't lose work. You'd also need a task to clear the owner on things that are past their Expiry.

Marc Gravell
Thanks, Marc. I didn't realize you could use TOP in a update. I was trying something like UPDATE [Table] SET [Owner]=@processor WHERE ID IN (SELECT ID FROM (SELECT ROW_NUMBER() OVER (ORDER BY ID) ROWNUMBER, ID FROM [Table] WHERE [Owner] is null) A WHERE ROWNUMBER <= 100) since it seems to be SQL-99 compliant.
Otávio Décio
If you are going to that complexity, SET ROWCOUNT 100 would be easier...
Marc Gravell
A: 

You didn't say which database server you're using, but there are a couple of options.

MySQL includes an extension to SQL99's INSERT to limit the number of rows that are updated. You can assign each worker a unique token, update a number of rows, then query to get that worker's batch. Marc used the UPDATE TOP syntax, but didn't specify the database server.

Another option is to designate a table used for locking. Don't use the same table with the data, since you don't want to lock it for reading. Your lock table likely only needs a single row, with the next ID needing work. A worker locks the table, gets the current ID, increments it by whatever your batch size is, updates the table, then releases the lock. Then it can go query the data table and pull the rows it reserved. This option assumes the data table has a monotonically increasing ID, and isn't very fault-tolerant if a worker dies or otherwise can't finish a batch.

Steve Madsen
A: 

Quite similar to this question: SQL Server Process Queue Race Condition

You run a query to assign a 100 rows to a given processorid. If you use these locking hints then it's "safe" in the concurrency sense. And it's a single SQL statement with no SET statements needed.

This is taken from the other question:

UPDATE TOP (100)
    foo
SET
    ProcessorID = @PROCID
FROM
    OrderTable foo WITH (ROWLOCK, READPAST, UPDLOCK)
WHERE
    ProcessorID = 0    --Or whatever unassigned is
gbn