views:

380

answers:

3

I am working on a Django application which allows a user to upload files. I need to perform some server-side processing on these files before sending them on to Amazon S3. After reading the responses to this question and this blog post I decided that the best manner in which to handle this is to have my view handler invoke a method on Pyro remote object to perform the processing asynchronously and then immediately return an Http 200 to the client. I have this prototyped and it seems to work well, however, I would also like to store the state of the processing so that the client can poll the application to see if the file has been processed and uploaded to S3.

I can handle the polling easily enough, but I am not sure where the appropriate location is to store the process state. It needs to be writable by the Pyro process and readable by my polling view.

  • I am hesitant to add columns to the database for data which should really only persist for 30 to 60 seconds.
  • I have considered using Django's low-level cache API and using a file id as the key, however, I don't believe this is really what the cache framework is designed for and I'm not sure what unforeseen problems there might be with going this route.
  • Lastly, I have considered storing state in the Pyro object doing the processing, but then it still seems like I would need to add a boolean "processing_complete" database column so that the view knows whether or not to query state from the Pyro object.

Of course, there are also some data integrity concerns with decoupling state from the database (what happens if the server goes down and all this data is in-memory?). I am to hear how more seasoned web application developers would handle this sort of stateful processing.

+6  A: 

We do this by having a "Request" table in the database.

When the upload arrives, we create the uploaded File object, and create a Request.

We start the background batch processor.

We return a 200 "we're working on it" page -- it shows the Requests and their status.

Our batch processor uses the Django ORM. When it finishes, it updates the Request object. We can (but don't) send an email notification. Mostly, we just update the status so that the user can log in again and see that processing has completed.


Batch Server Architecture notes.

It's a WSGI server that waits on a port for a batch processing request. The request is a REST POST with an ID number; the batch processor looks this up in the database and processes it.

The server is started automagically by our REST interface. If it isn't running, we spawn it. This makes a user transaction appear slow, but, oh well. It's not supposed to crash.

Also, we have a simple crontab to check that it's running. At most, it will be down for 30 minutes between "are you alive?" checks. We don't have a formal startup script (we run under Apache with mod_wsgi), but we may create a "restart" script that touches the WSGI file and then does a POST to a URL that does a health-check (and starts the batch processor).

When the batch server starts, there may be unprocessed requests for which it has never gotten a POST. So, the default startup is to pull ALL work out of the Request queue -- assuming it may have missed something.

S.Lott
After thinking about this overnight I have decided that you are absolutely right. It just doesn't make sense not to use the database. I have also decided that Pyro is a bad fit here and that I should just do what normal people do and use a cron job with a lock file.
bouvard
We don't use cron. We have our batch system as a little WSGI server and we make an HTTP request with urllib2 to wake it up. It gets the Request ID from the WSGI request; gets the details with ordinary Django ORM.
S.Lott
This is sort of what I planned to do with Pyro, but the problem I foresee is that a sudden server outage could leave documents half-processed and there would be no new request message to re-initiate processing. If I use a cron job I know that I can just pick the old 10 unfinished jobs from the Request table and I will pickup any that got cutoff during the outage.
bouvard
I suppose I should have phrased that last comment as a question as clearly you have a way of dealing with this problem: what is your strategy?
bouvard
We don't like frequent crontab polling requests. Too much overhead in the database doing a SELECT every few minutes. The requests are relatively rare, so we use RESTful notification of a WSGI server.
S.Lott
Ah, I see. In my case its just the opposite. The uploads are central to the application and so the the processing requests could become _very_ frequent. As much as I try to avoid cron (more for philosophical reasons than anything else) I think it is the sensible solution here.
bouvard
Cron is polling, polling is bad. Use crontab only to confirm that things are running. You want a proper queue with a proper server that is properly waiting on the queue with nothing else to do. A WSGI-based server, waiting for each request will cover 99.7% of your operations. The "crash-restart" is so rare that it should not dominate your design decisions.
S.Lott
S. Lott: Thanks for all your input in this. After a couple false starts I've finally conceded that you are on the money with your WSGI approach. However, I am curious about one implementation detail: how do you ping the batch server to start the request without waiting for the reply? Do you somehow do that from the calling application, or does the batch server generate a new process (or thread) and then return 200 immediately? I have not found an elegant way of dealing with this. It seems like it ought to be trivial for the main application to start the batch server asynchronously.
bouvard
Our batch server has a small client API package that our main web apps use. This client API package generally uses urllib2 to make a request. If it can't establish a connection, it uses subprocess to spawn the batch server and then makes the request. Then the web app sends back the 200 irrespective of what the batch server has or has not actually done.
S.Lott
+1  A: 

So, it's a job queue that you need. For your case, I would absolutely go with the DB to save state, even if those states are short lived. It sounds like that will meet all of your requirements, and isn't terribly difficult to implement since you already have all of the moving parts there, available to you. Keep it simple unless you need something more complex.

If you need something more powerful or more sophisticated, I'd look at something like Gearman.

brianz
+1  A: 

I know this is an old question but someone may find my answer useful even after all this time, so here goes.

You can of course use database as queue but there are solutions developed exactly for that purpose.

AMQP is made just for that. Together with Celery or Carrot and a broker server like RabbitMQ or ZeroMQ.

That's what we are using in our latest project and it is working great.

For your problem Celery and RabbitMQ seems like a best fit. RabbitMQ provides persistency of your messages, and Celery exposes easy views for polling to check the status of processes run in parallel.

You may also be interested in octopy.

Bartosz Ptaszynski