views:

56

answers:

2

I'm working on a document management system. An example workflow would be something like this:

  1. A document is emailed to the system
  2. The system does a number of preparatory actions to the document
  3. Document is presented to a user for further processing
  4. Afterwards, document is sent to Quality Assurance
  5. Afterwards, the system does a number or post-processing actions to the document
  6. Document is considered completely processed and disseminated (e.g. emailed back to whoever emailed the document to the system, etc.)

Since the volume of my input will vary (but will usually be high volume), I am very concerend about scalability.

For example, say the system has already downloaded the email attachments. If the attachments are PDF documents, the system needs to split the PDF into individual pages, then convert each page into multiple size thumbnails, etc. I plan to have a cron job check (say, every minute) to see if there are an PDF documents that need to be processed. Using a flagging system (e.g. "PDF Document Ready to be Processed"), I can check the database for all PDF documents that are flagged to be processed. Once the PDF processing is done, the flag can be updated to say "PDF Processing Done."

However, since the processing of each PDF document is very time consuming, I am concerned that when the next cron job is executed, that cron job will also try to process the PDFs that the previous cron job is still processing.

A possible solution is to immediately flag the PDF documents with "PDF Document Currently Being Processed." That way, when the next cron job is executed, it will exclude the ones already being processed.

Thus, each step in the workflow will probably have 3 flags:

  1. PDF Document Ready to be Processed
  2. PDF Document Currently Being Processed
  3. PDF Processing Done

Same for QA:

  1. Document Ready for QA
  2. Document Currently Being QAd
  3. Document QA Done

Is this a good approach? Is there a better approach? Would I have these flags as a single column of the "PDF Document" table in the database? Or should the flags be its own table (e.g. especially if a document can have multiple flags set).

I'd like to solicit suggestions on how to implement such a system.

A: 

The solution kind of depends on what technologies you are using to implement this system is the pre / post processing done by the same software / language as the emailing software? Additionally are they running in seperate processes.

If you have distributed components you could do much worse than investigating an AMQP solution like RabbitMQ, as this takes care of putting each job into a queue, and making sure that only one of your consumers takes each job. (we'd model each thumbnailing job as individual tasks).

If however the entire system is implemented in one language, and inside a single process there's some simpler systems you can use:

  • Resque is a good solution for Ruby
  • Java would work well as a LinkedBlockingQueue
  • Uh, I'm sure c# will have some way of creating a queue of jobs (disclaimer: I know nothing of c#)
Ceilingfish
I'm implementing on WAMP/LAMP, but looking more for a technology-agnostic solution.
StackOverflowNewbie
So in that case RabbitMQ should still be an option for you, as AMQP is system agnostic (it's designed to allow different technologies to talk to each other in a uniform way). Alternatively there is a specific queue implementation for PHP in Zend Server (which I think costs much moneys), or there's this library for Resque http://github.com/chrisboulton/php-resque which plugs into PHP code
Ceilingfish
+1  A: 

To solve your concern about concurrent processing on the same document, you can use many scheduler packages to help you manage this aspect. http://www.quartz-scheduler.org/ is one I've used with great success.

To address your problem, I'd have the 3 states, received, queued, processed (similar to what you suggest).

I'd have a scheduled recurring job which polls the database, looking for received pdfs, and for each, queue a job to process and mark the pdf as queued. If you ensure this happens in the same transaction, and utilize optimistic locking, there is no risk another job could come along and re-read this as received.

Quartz uses a thread pool, with may configuration options, and is great for deferred, resource intensive processing (I use it for image thumbnailing in a server setting).

To take a step back, there are some great workflow packages in the java world which can handle most of what you want to do, including the deferred pdf processing. Take a look at jbpm or drools flow, these are two great, if complex, packages.

Taylor