views:

170

answers:

3

im looking to write a daemon that:

  • reads a message from a queue (sqs, rabbit-mq, whatever ...) containing a path to a zip file
  • updates a record in the database saying something like "this job is processing"
  • reads the aforementioned archive's contents and inserts a row into a database w/ information culled from file meta data for each file found
  • duplicates each file to s3
  • deletes the zip file
  • marks the job as "complete"
  • read next message in queue, repeat

this should be running as a service, and initiated by a message queued when someone uploads a file via the web frontend. the uploader doesn't need to immediately see the results, but the upload be processed in the background fairly expediently.

im fluent with python, so the very first thing that comes to mind is writing a simple server with twisted to handle each request and carry out the process mentioned above. but, ive never written anything like this that would run in a multi-user context. its not going to service hundreds of uploads per minute or hour, but it'd be nice if it could handle several at a time, reasonable. i also am not terribly familiar with writing multi-threaded applications and dealing with issues like blocking.

how have people solved this in the past? what are some other approaches i could take?

thanks in advance for any help and discussion!

+1  A: 

I've used Beanstalkd as a queueing daemon to very good effect (some near-time processing and image resizing - over 2 million so far in the last few weeks). Throw a message into the queue with the zip filename (maybe from a specific directory) [I serialise a command and parameters in JSON], and when you reserve the message in your worker-client, no one else can get it, unless you allow it to time out (when it goes back to the queue to be picked up).

The rest is the unzipping and uploading to S3, for which there are other libraries.

If you want to handle several zip files at once, run as many worker processes as you want.

Alister Bulman
+1  A: 

I would avoid doing anything multi-threaded and instead use the queue and the database to synchronize as many worker processes as you care to start up.

For this application I think twisted or any framework for creating server applications is going to be overkill.

Keep it simple. Python script starts up, checks the queue, does some work, checks the queue again. If you want a proper background daemon you might want to just make sure you detach from the terminal as described here: http://stackoverflow.com/questions/473620/how-do-you-create-a-daemon-in-python

Add some logging, maybe a try/except block to email out failures to you.

rhettg
this would be running as a service that would be triggered when someone uploads a file. they dont need to immediately see the results, but the upload should trigger this process in the background.im amending the question to reflect this.
Carson
+1  A: 

i opted to use a combination of celery (http://ask.github.com/celery/introduction.html), rabbitmq, and a simple django view to handle uploads. the workflow looks like this:

  1. django view accepts, stores upload
  2. a celery Task is dispatched to process the upload. all work is done inside the Task.
Carson