Processing a database queue across multiple threads - design advice

views:

135

answers:

+2 Q:

Processing a database queue across multiple threads - design advice

I have a SQL Server table full of orders that my program needs to "follow up" on (call a webservice to see if something has been done with them). My application is multi-threaded, and could have instances running on multiple servers. Currently, every so often (on a Threading timer), the process selects 100 rows, at random (ORDER BY NEWID()), from the list of "unconfirmed" orders and checks them, marking off any that come back successfully.

The problem is that there's a lot of overlap between the threads, and between the different processes, and their's no guarantee that a new order will get checked any time soon. Also, some orders will never be "confirmed" and are dead, which means that they get in the way of orders that need to be confirmed, slowing the process down if I keep selecting them over and over.

What I'd prefer is that all outstanding orders get checked, systematically. I can think of two easy ways do this:

The application fetches one order to check at a time, passing in the last order it checked as a parameter, and SQL Server hands back the next order that's unconfirmed. More database calls, but this ensures that every order is checked in a reasonable timeframe. However, different servers may re-check the same order in succession, needlessly.
The SQL Server keeps track of the last order it asked a process to check up on, maybe in a table, and gives a unique order to every request, incrementing its counter. This involves storing the last order somewhere in SQL, which I wanted to avoid, but it also ensures that threads won't needlessly check the same orders at the same time

Are there any other ideas I'm missing? Does this even make sense? Let me know if I need some clarification.

RESULT:

What I ended up doing was adding a LastCheckedForConfirmation column to my table with finished orders in it, and I added a stored procedure that updates a single, Unconfirmed row with GETDATE() and kicks out the order number so my process can check on it. It spins up as many of these as it can (given the number of threads the process is willing to run), and uses the stored procedure to get a new OrderNumber for each thread.

To handle the "Don't try rows too many times or when they're too old" problem, I did this: The SP will only return a row if "Time since last try" > "Time between creation and last try", so each time it will take twice as long before it tries again - first it waits 5 seconds, then 10, then 20, 40, 80, 120, and then after it's tried 15 times (6 hours), it gives up on that order and the SP will never return it again.

Thanks for the help, everybody - I knew the way I was doing it was less than ideal, and I appreciate your pointers in the right direction.

+1 A:

The obvious way would be to add a column LastCheckDt to the order. In each thread, retrieve the order that has gone for the longest time without checking. In the procedure that retrieves the order, update the LastCheckDt field.

I wouldn't retrieve 100 orders at once, there is a risk of the 50th order changing in the database before your thread reaches it. Get one order, and when done, get the next one.

In addition, I'd initially develop the process without multi-threading. Checking an open order is usually fast enough to be done sequentially.

Andomar 2010-05-27 15:48:09

The service was grabbing the next X orders and then spinning off threads to make the webservice call, which could take a few seconds to complete, and I didn't want to stall the service while it waits.

rwmnau 2010-05-27 16:15:03

One strategy you might want to Consider is a table like this;

JobID bigint PK not null, WorkerID int/nvarchar(max) null

Where worker is the id/name of the server that is processing it, or null if nobody has picked up the job. When a server picks up a job, it puts its own id/name into that column which indicates to others not to pick up the job.

One problem is that it is possible that the server working a job crashes, making the job never complete. You could add a date column that would represent the timeout, which is set when the worker picks up the job to now + some time span that you decide is appropriate.

EDIT: Forgot to mention, you will either need to delete to job when it is complete, or have a status field to indicate completion. An additional field could indicate parameters for the job to make your job table generic: ie. don't just make a solution for your orders, make a job manager that can process anything you will need in the future.

csauve 2010-05-27 15:52:13

That is basically what I was trying to type when I got notified of a new answer. The only other thing I was going to say was to only pick one record at a time, and pick them (WorkerID NOT NULL) in order of Order creation date (ie oldest gets processed first).

kevinw 2010-05-27 15:57:11

+2 A:

I recommend read and internalize Using tables as Queues.

If you use the data as a queue, you must organize it properly for queuing operations. The article I linked goes into details about how to do this, what you have is a variant of a Pending Queue.

One thing you must absolutely get rid of is the randomness. If there is one thing that is hard to reproduce in a query, is randomness. ORDER BY NEWID() will scan every row, generate a guid, then SORT, and then give you back top 100. You cannot, under any circumstances, have every worker thread scan the entire table every time, you'll kill the server as the number of unprocessed entries grows.

Instead use pending processing date. Have the queue be organized (clustered) by processing date column (when the item is due for retry) and dequeue using the techniques I show in my linked article. If you want to retry, the dequeue should postpone the item instead of deleting it, ie. WITH (...) UPDATE SET due_date = dateadd(day, 1, getutcdate()) ...

Remus Rusanu 2010-05-27 15:59:55

Very nice article indeed. Only problem I see with it might be that it is using OUTPUT, which will remove the row. If the worker process crashes/loses power, the work will be totally lost. There are ways around this, but I'd prefer to put the solution in the same table.

csauve 2010-05-27 16:05:38

I agree that the ORDER BY NEWID() is a terrible idea long-term - it was just a quick way to get it done, and I've realized that it needs to be changed before the application goes into production. Thanks for your suggestions - I'll see if there's an easy way to attempt a "last confirmation check" column.

rwmnau 2010-05-27 16:14:01

This is nice article. However, it does not give u 100% reliability and there is still a chance when same message will be processed twice or more times. The only way to deal with this - have a central broker service, that enqueues and dequeues messages. This way only one thread updates and accesses queue data. This may be optimized by fetching sets of messages instead of one message at a time.

IMHO 2010-05-27 18:16:37

@csauve: you don't have to commit the dequeue. You can begin transaction, dequeue, process and only then commit. If you follow the guidance of the article I linked then the good think is that dequeues don't block each other, one thread can dequeue and not commit w/o blocking other processing threads. If the processing is *really* long (minutes) then the DELETE ... WITH OUTPUT can be replaced with an UPDATE ... WITH OUTPUT and crashed processors can be reclaimed.

Remus Rusanu 2010-05-27 18:18:50

@IMHO: can you explain how exactly can a message be processed twice?

Remus Rusanu 2010-05-27 18:19:55

let me put it this way - it's either 2 messages or very hot spot in the db. Either is not acceptable in my book. I'd prefer to serialize rather then locking the queue table and relying on RDBMS implementation of clustered index/rowlock. This is why there are separate messaging services specifically designed for this purpose. When such solution is not available from financial or technology limitation point of view, I'd fall back to serialized approach using one of your designs for queue tables.

IMHO 2010-05-27 20:56:45

@IMHO: High scale/High throughput enqueue/dequeue semantics for messaging in the database *is* possible. Read all about it at http://msdn.microsoft.com/en-us/library/ms166043%28SQL.90%29.aspx. And since I happen to be one of the engineers that implemented it I can tell you that it does **not** create a hot spot in the database.

Remus Rusanu 2010-05-27 21:22:21

I agree the Service Broker is reliable solution, but you yourself advise against it due to difficulty of implementation and other reasons.

IMHO 2010-05-28 13:13:52

@IMHO: the table based Queues in my article though are not difficult to implement. SSB's own queues are based on exactly the same concepts as the tables in my article; proper clustered keys to facilitate dequeue/enqueue operations, use of OUTPUT clause for atomic read-write operations, READPAST hint for concurrency amongst readers.

Remus Rusanu 2010-05-28 15:27:12

ansaurus

tags:

views:

answers:

Processing a database queue across multiple threads - design advice

related questions