views:

892

answers:

4

I have a databases table with ~50K rows in it, each row represents a job that need to be done. I have a program that extracts a job from the DB, does the job and puts the result back in the db. (this system is running right now)

Now I want to allow more than one processing task to do jobs but be sure that no task is done twice (as a performance concern not that this will cause other problems). Because the access is by way of a sproce, my current though is to replace said sproce with something that looks something like this

update tbl set owner=connection_id() where avalable and owner is null limit 1;
select stuff from tbl where owner = connection_id();

BTW; worker's tasks might drop there connection between getting a job and submitting the results. Also, I don't expect the DB to even come close to being the bottle neck unless I mess that part up (~5 jobs per minute)

Are there any issues with this? Is there a better way to do this?

Note: the "Database as an IPC anti-pattern" is only slightly apropos here because 1) I'm not doing IPC (there is no process generating the rows, they all already exist right now) and 2) the primary gripe described for that anti-pattern is that it results in unneeded load on the DB as processes wait for messages (in my case, if there are no messages, everything can shutdown as everything is done)

A: 

You are trying to implement de "Database as IPC" antipattern. Look it up to understand why you should consider redesigning your software properly.

Krunch
How do you know it's an antipattern in this case, or that the software design is improper? You don't have any context on which to base this comment whatsoever.
Greg Beech
A quick google does not find anything on "Database as IPC"
Nathan Lee
I'd called it a useful pattern for asynchronous IPC. You can configure it to operate like any garden-variety message queue, and they aren't in my experience branded "antipatterns".
le dorfier
Here's a reference to the antipattern - http://tripatlas.com/Database_as_an_IPC The difference is that we're discussing using the database as a message queue, not as a mechanism for processes to interoperate.
le dorfier
A: 

Instead of having owner = null when it isn't owned, you should set it to a fake nobody record instead. Searching for null doesn't limit the index, you might end up with a table scan. (this is for oracle, SQL server might be different)

Nathan Lee
+4  A: 

Here's what I've used successfully in the past:

MsgQueue table schema

MsgId identity -- NOT NULL MsgTypeCode varchar(20) -- NOT NULL
SourceCode varchar(20) -- process inserting the message -- NULLable
State char(1) -- 'N'ew if queued, 'A'(ctive) if processing, 'C'ompleted, default 'N' -- NOT NULL CreateTime datetime -- default GETDATE() -- NOT NULL
Msg varchar(255) -- NULLable

Your message types are what you'd expect - messages that conform to a contract between the process(es) inserting and the process(es) reading, structured with XML or your other choice of representation (JSON would be handy in some cases, for instance).

Then 0-to-n processes can be inserting, and 0-to-n processes can be reading and processing the messages, Each reading process typically handles a single message type. Multiple instances of a process type can be running for load-balancing.

The reader pulls one message and changes the state to "A"ctive while it works on it. When it's done it changes the state to "C"omplete. It can delete the message or not depending on whether you want to keep the audit trail. Messages of State = 'N' are pulled in MsgType/Timestamp order, so there's an index on MsgType + State + CreateTime.

Variations:
State for "E"rror.
Column for Reader process code.
Timestamps for state transitions.

This has provided a nice, scalable, visible, simple mechanism for doing a number of things like you are describing. If you have a basic understanding of databases, it's pretty foolproof and extensible.

le dorfier
The part described as "The reader pulls one message and changes the state to "A"ctive while it works on it." is the part I'm interested in. How do you do that bit? (aside from that, it looks like mine is the same as yours with out the stuff that isn't needed for my case.)
BCS
Right, that requires multiple SQL statements between BEGIN TRAN and COMMIT TRAN.Immediately following - an SP for pulling the next message - hacked up a bit, I've omitted error trapping since it was written pre-TRY/CATCH.
le dorfier
-- PART 1CREATE PROCEDURE GetMessage @MsgType VARCHAR(8)) AS DECLARE @MsgId INTBEGIN TRANSELECT TOP 1 @MsgId = MsgIdFROM MsgQueueWHERE MessageType = @pMessageType AND State = 'N'ORDER BY CreateTime
le dorfier
PART 2IF @MsgId IS NOT NULL BEGIN UPDATE MsgQueue SET State = 'A' WHERE MsgId = @MsgId SELECT MsgId, Msg FROM MsgQueue WHERE MsgId = @MsgId END ELSE BEGIN SELECT MsgId = NULL, Msg = NULL END COMMIT TRAN
le dorfier
what if i have to select more than(multiple) one row(s) at a time?can i update all at same time?
Amitd
Assuming you mark them with a common timestamp, or selection-batch id, you can update them all in a single statement, yes. Or use the "A" state described above, and update where state = 'A'.
le dorfier
A: 

Just as a possible technology change, you might consider using MSMQ or something similar.

Each of your jobs / threads could query the messaging queue to see if a new job was available. Because the act of reading a message removes it from the stack, you are ensured that only one job / thread would get the message.

Of course, this is assuming you are working with a Microsoft platform.

Chris Lively
I have the data in the DB, when I'm done I need the data in the DB. In my case I see no reason to add another component to the system. (BTW http://www.microsoft.com/windowsserver2003/technologies/msmq/default.mspx)
BCS