I'm working on a document management system. An example workflow would be something like this:
- A document is emailed to the system
- The system does a number of preparatory actions to the document
- Document is presented to a user for further processing
- Afterwards, document is sent to Quality Assurance
- Afterwards, the system does a number or post-processing actions to the document
- Document is considered completely processed and disseminated (e.g. emailed back to whoever emailed the document to the system, etc.)
Since the volume of my input will vary (but will usually be high volume), I am very concerend about scalability.
For example, say the system has already downloaded the email attachments. If the attachments are PDF documents, the system needs to split the PDF into individual pages, then convert each page into multiple size thumbnails, etc. I plan to have a cron job check (say, every minute) to see if there are an PDF documents that need to be processed. Using a flagging system (e.g. "PDF Document Ready to be Processed"), I can check the database for all PDF documents that are flagged to be processed. Once the PDF processing is done, the flag can be updated to say "PDF Processing Done."
However, since the processing of each PDF document is very time consuming, I am concerned that when the next cron job is executed, that cron job will also try to process the PDFs that the previous cron job is still processing.
A possible solution is to immediately flag the PDF documents with "PDF Document Currently Being Processed." That way, when the next cron job is executed, it will exclude the ones already being processed.
Thus, each step in the workflow will probably have 3 flags:
- PDF Document Ready to be Processed
- PDF Document Currently Being Processed
- PDF Processing Done
Same for QA:
- Document Ready for QA
- Document Currently Being QAd
- Document QA Done
Is this a good approach? Is there a better approach? Would I have these flags as a single column of the "PDF Document" table in the database? Or should the flags be its own table (e.g. especially if a document can have multiple flags set).
I'd like to solicit suggestions on how to implement such a system.