tags:

views:

719

answers:

5

My question is: which python framework should I use to build my server?

Notes:

  • This server talks HTTP with it's clients: GET and POST (via pyAMF)
  • Clients "submit" "tasks" for processing and, then, sometime later, retrieve the associated "task_result"
  • submit and retrieve might be separated by days - different HTTP connections
  • The "task" is a lump of XML describing a problem to be solved, and a "task_result" is a lump of XML describing an answer.
  • When a server gets a "task", it queues it for processing
  • The server manages this queue and, when tasks get to the top, organises that they are processed.
  • the processing is performed by a long running (15 mins?) external program (via subprocess) which is feed the task XML and which produces a "task_result" lump of XML which the server picks up and stores (for later Client retrieval).
  • it serves a couple of basic HTML pages showing the Queue and processing status (admin purposes only)

I've experimented with twisted.web, using SQLite as the database and threads to handle the long running processes.

But I can't help feeling that I'm missing a simpler solution. Am I? If you were faced with this, what technology mix would you use?

A: 

It seems any python web framework will suit your needs. I work with a similar system on a daily basis and I can tell you, your solution with threads and SQLite for queue storage is about as simple as you're going to get.

Assuming order doesn't matter in your queue, then threads should be acceptable. It's important to make sure you don't create race conditions with your queues or, for example, have two of the same job type running simultaneously. If this is the case, I'd suggest a single threaded application to do the items in the queue one by one.

Joey Robert
A: 

I'd suggest the following. (Since it's what we're doing.)

A simple WSGI server (wsgiref or werkzeug). The HTTP requests coming in will naturally form a queue. No further queueing needed. You get a request, you spawn the subprocess as a child and wait for it to finish. A simple list of children is about all you need.

I used a modification of the main "serve forever" loop in wsgiref to periodically poll all of the children to see how they're doing.

A simple SQLite database can track request status. Even this may be overkill because your XML inputs and results can just lay around in the file system.

That's it. Queueing and threads don't really enter into it. A single long-running external process is too complex to coordinate. It's simplest if each request is a separate, stand-alone, child process.

If you get immense bursts of requests, you might want a simple governor to prevent creating thousands of children. The governor could be a simple queue, built using a list with append() and pop(). Every request goes in, but only requests that fit will in some "max number of children" limit are taken out.

S.Lott
"A single long-running external process is too complex to coordinate"?Coordinate with what? Actually, the idea of having an external worker is to prevent any need for coordination and being able to easily control parallelism. As you noted, spawning processes is indeed a problem when you expect request bursts, and then you really need some more coordination.I usually set up several workers on several machines, and supervise them using supervisord (http://supervisord.org/)
thesamet
How do you get work to and from this long running process? It seems simpler to just fork work as a subprocess rather than engage it yet another IPC exercise to coordinate with these external workers.
S.Lott
You use a queue framework that manages these technical details for you (see my answer).I agree that the subprocesses approach is simpler and there's less infrastructure to worry about, and it's a good fit for certain applications. But for anything I run in production, I'd prefer something that offers more control on parallelism, and setting a fixed number of external workers does exactly that.
thesamet
+3  A: 

I'd recommend using an existing message queue. There are many to choose from (see below), and they vary in complexity and robustness.

Also, avoid threads: let your processing tasks run in a different process (why do they have to run in the webserver?)

By using an existing message queue, you only need to worry about producing messages (in your webserver) and consuming them (in your long running tasks). As your system grows you'll be able to scale up by just adding webservers and consumers, and worry less about your queuing infrastructure.

Some popular python implementations of message queues:

thesamet
Another one based on rabbitmq/amqp: http://pypi.python.org/pypi/carrot/0.3.3
Van Gale
You really are looking for a message queue. I've had good luck with beanstalkd and I've heard good things about http://www.rabbitmq.com/ and http://www.zeromq.org/ , and there's also always gearman: http://www.danga.com/gearman/ . Sounds like you're more interested in long running tasks than high frequency invocation, so probably just about any queue will work for you.
Parand
+1  A: 

My reaction is to suggest Twisted, but you've already looked at this. Still, I stick by my answer. Without knowing you personal pain-points, I can at least share some things that helped me reduce almost all of the deferred-madness that arises when you have several dependent, blocking actions you need to perform for a client.

Inline callbacks (lightly documented here: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.html) provide a means to make long chains of deferreds much more readable (to the point of looking like straight-line code). There is an excellent example of the complexity reduction this affords here: http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html

You don't always have to get your bulk processing to integrate nicely with Twisted. Sometimes it is easier to break a large piece of your program off into a stand-alone, easily testable/tweakable/implementable command line tool and have Twisted invoke this tool in another process. Twisted's ProcessProtocol provides a fairly flexible way of launching and interacting with external helper programs. Furthermore, if you suddenly decide you want to cloudify your application, it is not all that big of a deal to use a ProcessProtocol to simply run your bulk processing on a remote server (random EC2 instances perhaps) via ssh, assuming you have the keys setup already.

rndmcnlly
You might use https://launchpad.net/ampoule to make life a bit easier on the spawning-processes side of things.
Glyph
A: 

You can have a look at celery

Kishore A