views:

183

answers:

2

At scheduled times (based on business rules), we need anywhere between 10,000 - 200,000 rows from a database to be processed by a long running operation. Each row needs to be processed individually (they don't depend on each other) and it is ok for this to happen asynchronously. The successful completion of the operation must be documented (probably to a database).

I presume the best way to do this would be using a message queue, so that is it durable and could easily scale up. Firstly, is this the best solution? If so, what is the most efficient way to put all the rows from the database onto the queue, guaranteeing that all the rows have gone to the queue?

Note: We are a .Net team using C#3.5, WCF, MSMQ and SQL 2005. I have investigated NServiceBus and would be willing to use this if recommended.

I am worried about the database being the bottleneck (updating each row to mark as processed). I also don't know how to "transactionally send" a message.

  1. How do I "transactionally send" a message? What I mean by this is: load a row from db, send message, update row in db. If the update fails, I don't want the message to send.
  2. Is this a common scenario or should I be doing it a different way?
  3. My worry is that updating individual rows in the db will cause a bottleneck. Can I "transactionally send" a batch of messages, then batch update the db?
+2  A: 

Yes. Message queues are designed for exactly this thing. That's what they're for. The message queue solution can scales to astronomical size. You'll run out of processing capability before the queues themselves are a limiting factor.

Generally, you want concurrent separate processes do run these things. Don't waste a single brain calorie on a threaded solution. You want as many resources as you can apply, and the OS-level allocation of resources is best for this kind of thing.

You want the simplest way that you can do the following.

  1. Create the Queue.

  2. Create several consumer processes that are all reading from the queue.

  3. Start produce process. This will execute your query and write the queue, doing as little as possible in the process.

Each consumer process will be competing for an entry in the queue. Think of Ice Hockey face-off among idle consumers. Once a consumer grabs it's row, it can do with it as it pleases.

When the query is empty, you have to close everything down. It's hard (but not impossible) to close down the consumers. It's common to have a special "all done" message that you put into the queue after the query. If you have n consumers, you put n copies of this all-done into the queue so that the consumers can all shut down cleanly.

Note that this looks a lot like a Unix pipeline. For a very good reason.


Edit.

  1. How do I "transactionally send" a message? What I mean by this is: load a row from db, send message, update row in db. If the update fails, I don't want the message to send.

    Do not do this. It's a mess. Your producer sends the messages. The queue itself is perfectly reliable. Messages do not "get lost" or "somehow" fail to process. The queue can be configured to work with the file system so that the messages are persistent until consumed.

    The consumers can mark things as "done" when they're done with them. This will slow things down a lot.

    If you want to keep "processing" status, use a data warehousing technique and don't keep it with your transactional data.

  2. Yes, this is a common scenario. Everyone tries to do too much in the producer side of this sort of thing. You rarely need all the "transactional" processing you think you need. You have to detail the end-user requirements for this. Do you really need that update? Or is it just there because it seems like you should record the processing status somewhere?

    Remember queues are very reliable. Don't reinvent a persistent queue in the database.

    Read http://www.microsoft.com/windowsserver2003/techinfo/overview/msmqfaq.mspx on "transactional messaging". You have lots of configuration options to guarantee the message enters the queue.

  3. My worry is that updating individual rows in the db will cause a bottleneck.

    Good thing to think about. So don't do it. The question in database updates is always "why?" If it's "for completeness" that's no reason at all. If it's "for recovery" or "to prevent re-processing", you might want to think of a better design.

    Doing inserts into a log is much, much faster and still allows you to do a where-not-exists query to locate unprocessed rows in the very unlikely event of having to recover a partially processed history.

The simplest queue is the best. You rarely need a lot of transactional funny-business. Just queue the messages up and do the work on the consumer side.

S.Lott
Why shut the consumers down at all? The process appears to be periodic, let the consumers go idle and wait for more data in the queue. If you use a timeout on the RecieveMessage then you can have them periodically drop out of the blocked state and check to see if they have recieved a control message to shutdown.
GrayWizardx
@GrayWizardx: Good point. We're Linux, and closing the source at the head of the pipeline propagates a simple EOF to the consumers and it all closes down nicely. I was projecting that close-down operation. If it's not needed, then save the complexity.
S.Lott
Thanks, this is also great info. If either of you could help with my further points, that would be great. I'm concerned about a bottleneck at the db and how to transactionally send messages.
JontyMC
This is nearer to what I'm looking for. The point of this question is that is a requirement that all rows must be processed and logged in a timely fashion, and this must be scalable. This logging somehow must make it back to the db, as the UI needs to report on this (although this doesnt need to be real time). As I see it, I can either update the db when I send to the queue (if so do I batch and how, if I batch, do I ensure all rows were indeed sent?), or I can update on the processing side (pretty much same question - how to avoid individual row updates and still ensure data integrity).
JontyMC
@jontymc: You cannot avoid row updates if row updates are a requirement. I don't get the hand-wringing. You can use a fetch loop with an `UPDATE WHERE CURRENT` -- which updates the row being fetched. Or you can have the consumers do the row update.
S.Lott
+2  A: 

NServiceBus will make the process of setting up the queues much easier. This (the msmq design) is a common pattern to use for this operation, but it is not your only option.

You can also look at SQL Server Service Broker and many other similar technologies to do this same thing.

There are a couple of caveats you should be aware of with MSMQ:

  1. Transactional Queues cannot be load balanced, unless they are a Active Directory Domain Queue. The big take away here is that the queue must be on a single machine, which means that it is susceptible to loss if the machine is lost (either permenantly or temporarily). This is not a huge concern but one to take note of
  2. MSMQ queues have two "modes" transactional and non-transactional. Transactional queues are the only ones that guarantee message delivery.
  3. MSMQ messages are themselves limited to 4MB (or so) natively, and you must manage serialization yourself (though default .NET serialization is pretty easy with the XML serializer). If you want messages larger than 4MB you will need to either manage them outside of the queue, or manage multiple queued messages yourself (BizTalk has a way of doing this so its not a huge issue). 4MB should be plenty big for your needs.
  4. Once you "accept" a message from the queue it is immediately removed, so depending on your design this might be an issue. It will be possible for your consumers to "accept" a message, fail and have the message not make it back to the queue.

Having said all of that, MSMQ is very reliable and stable, if you plan your implementation out and use it for the messaging part of your process, not the data storage part.

Finally, as an alternative to your current proposal (and so you have something to compare to) you could implement your described scenario directly from the DB. As a napkin sketch:

  1. A process runs in the db and populates a table with "pending" rows to process, assign each one a unique id (guid, etc)
  2. Create a SP which returns "n" of these rows to a caller, and marks those same rows as "pending" in the db. If there are no rows it returns 0 or -1, or whatever
  3. Create a SP which recieves a list of row ids, and the disposition (completion info) for the job and updates the pending table, either marking them done or removing them and logging the completion data
  4. Your consumers call into the first SP and request a set of rows to work on
  5. Your consumers process the rows
  6. Your consumers call the second SP to log the work done

You could then periodically run reports to see what work was done and still pending, and if needed change rows from pending to waiting, etc. This would have roughly the same scaling as your other solution, remove a layer of indirection (which may be a bad thing, depending) and provide a slightly more linear process. This process is essentially how Service Broker works (very distilled of course).

It all depends on how you feel most comfortable implementing this. I have done it both ways, and both have their pros and cons.

GrayWizardx
Thanks, that is great info. I've updated the question to clarify my problems.
JontyMC