views:

3559

answers:

12

A (long) while ago I wrote a web-spider that I multithreaded to enable concurrent requests to occur at the same time. That was in my Python youth, in the days before I knew about the GIL and the associated woes it creates for multithreaded code (IE, most of the time, its not really multithreaded!)...

I'd like to rework this code to make it more robust and perform better. There are basically two ways I could do this: I could use the new multiprocessing module in 2.6+ or I could go for a reactor / event-based model of some sort. I would rather do the later since it's far simpler and less error-prone.

So the question relates to what framework would be best suited to my needs. The following is a list of the options I know about so far:

  • Twisted: The granddaddy of Python reactor frameworks: seems complex and a bit bloated however. Sleep learning curve for a small task.
  • Eventlet: From the guys at lindenlab. Greenlet based framework that's geared towards these kinds of tasks. I had a look at the code though and it's not too pretty: non-pep8 compliant, scattered with prints (why do people do this in a framework!?), API seems a little inconsistent.
  • PyEv: Immature, doesn't seem to be anyone using it right now though it is based on libevent so it's got a solid backend.
  • asyncore: From the stdlib: über low-level, seems like a lot of legwork involved just to get something off the ground.
  • tornado: Though this is a server oriented product designed to server dynamic websites it does feature an async HTTP client and a simple ioloop. Looks like it could get the job done but not what it was intended for. [edit: doesn't run on Windows unfortunately, which counts it out for me - its a requirement for me to support this lame platform]
  • scrapy: A twisted-based framework designed specifically for writing web spiders. Looks really nice in principle except it relies on libxml2 which is not easy to install or distribute on OS X (my preferred host platform) and additionally is not really factored to work like as a library for other host applications yet.

Is there anything I have missed at all? Surely there must be a library out there that fits the sweet-spot of a simplified async networking library!

[edit: big thanks to intgr for his pointer to this page. If you scroll to the bottom you will see there is a really nice list of projects that aim to tackle this task in one way or another. It seems actually that things have indeed moved on since the inception of Twisted: people now seem to favour a co-routine based solution rather than a traditional reactor / callback oriented one. The benefits of this approach are clearer more direct code: I've certainly found in the past, especially when working with boost.asio in C++ that callback based code can lead to designs that can be hard-to-follow and are relatively obscure to the untrained eye. Using co-routines allows you to write code that looks a little more synchronous at least. I guess now my task is to work out which one of these many libraries I like the look of and give it a go! Glad I asked now...]

[edit: perhaps of interest to anyone who followed or stumbled on this this question or cares about this topic in any sense: I found a really great writeup of the current state of the available tools for this job]

+7  A: 

I liked the concurrence Python module which relies on either Stackless Python microthreads or Greenlets for light-weight threading. All blocking network I/O is transparently made asynchronous through a single libevent loop, so it should be nearly as efficient as an real asynchronous server.

I suppose it's similar to Eventlet in this way.

The downside is that its API is quite different from Python's sockets/threading modules; you need to rewrite a fair bit of your application (or write a compatibility shim layer)

Edit: It seems that there's also cogen, which is similar, but uses Python 2.5's enhanced generators for its coroutines, instead of Greenlets. This makes it more portable than concurrence and other alternatives. Network I/O is done directly with epoll/kqueue/iocp.

intgr
@intgr: great links. I had seen both of those before once upon-a-time, those are the sorts of things I was hoping to see flushed out. +1
jkp
+13  A: 

Twisted is complex, you're right about that. Twisted is not bloated.

If you take a look here: http://twistedmatrix.com/trac/browser/trunk/twisted you'll find an organized, comprehensive, and very well tested suite of many protocols of the internet, as well as helper code to write and deploy very sophisticated network applications. I wouldn't confuse bloat with comprehensiveness.

It's well known that the Twisted documentation isn't the most user-friendly from first glance, and I believe this turns away an unfortunate number of people. But Twisted is amazing (IMHO) if you put in the time. I did and it proved to be worth it, and I'd recommend to others to try the same.

clemesha
@clemesha: maybe you are right, and it's not bloadted, but it does feel like there is rather a bit too much to get my head around to do something simple. I understand async programming, I've worked in C++ with boost::asio so the concepts are not new, but its all the gumph that compes with doing twisted stuff: it's a whole new world, much like django is for web stuff. Again when i'm doing web stuff I work with lightweight WSGI code and plug together only what i need. Horses for courses I guess.
jkp
@clemesha: erm, I took the plunge today to have a look: twisted weighs in at 20MB! Even the core is 12MB....if that isn't bloated, I'm not sure quite what is.
jkp
The basic Twisted APIs are pretty small (reactor, deferred, protocol). Most of the Twisted code is async protocol implementations using those basics."Bloat" is not a useful adjective here (or indeed in most cases). Twisted's size is reasonable for the amount of stuff it does.
daf
+2  A: 

None of these solutions will avoid that fact that the GIL prevents CPU parallelism - they are just better ways of getting IO parallelism that you already have with threads. If you think you can do better IO, by all means pursue one of these, but if your bottleneck is in processing the results nothing here will help except for the multiprocessing module.

What's wrong with using multiple processes?
Emil Ivanov
Nothing at all, hence the suggestion to use the multiprocessing module.
A: 

Anyone thinking about high performance servers should watch the Node.js video from jsconf.eu. It has a wonderful description of the problems in current servers and how we think about the problem. Indeed, with the problem of how IO is taught.

The movie: Ryan Dahl at jsconf.eu

The site: nodejs.org

Willison's weblog about Node.js

This is nothing short than a rethinking of high performance servers. Designed for IRC, HTTP, long-polling Comet, etc. No blocking calls. None.

(The movie is a classic. The hand-drawn "NIKE" shirt really puts it over the top. As Tim Bray said of the video, "Um, wow.") Dahl is now one of my programming heroes.

Nosredna
-1 for advertisement; *"most revolutionary server ever*"* belongs to a sales brochure, not an informed discussion. And what does this have to do with Python?
intgr
It's not an advertisement! I have nothing to do with it. But it's the coolest thing I've seen in years. What it has to do with _Twisted_ is that the video has the best statement of the problems of Twisted that I've seen.
Nosredna
Updated for @intgr to somewhat mask my obvious excitement and enthusiasm.
Nosredna
This is off topic even though it is interesting.
John Bellone
I don't see how it's off topic. Node.js is designed as an alternative to Twisted (due to the problems of Twisted, as spelled out by the OP and others), which is what the question is asking for.
Nosredna
The question was clearly about Python and Python frameworks.
John Bellone
+5  A: 

Kamaelia hasn't been mentioned yet. Its concurrency model is based on wiring together components with message passing between inboxes and outboxes. Here's a brief overview.

lost-theory
I used kamaelia for an app - it was extremely painful. IMHO there are other, better options for concurrenct in python (most of which are mentioned above)
Ben Ford
+5  A: 

I wouldn't go as far as to call Twisted bloated, but it is difficult to wrap your head around. I avoided really settling in an learn for quite a while as I always wanted something a little easier for 'small tasks'.

However, now that I have worked with it some more I have to say having all the batteries included is VERY nice.

All the other async libraries I've worked with end being way less mature than they even appear. Twisted's event loop is solid.

I'm not quite sure how to solve the steep Twisted learning curve. It might help if someone would fork it and clean a few things up, like removing all the backwards compatability cruft and the dead projects. But that's the nature of mature software I guess.

rhettg
+6  A: 

gevent is eventlet cleaned up.

API-wise it follows the same conventions as the standard library (in particular, threading and multiprocessing modules) where it makes sense. So you have familiar things like Queue and Event to work with.

It only supports libevent as reactor implementation but takes full advantage of it, featuring a fast WSGI server based on libevent-http and resolving DNS queries through libevent-dns as opposed to using a thread pool like most other libraries do.

Like eventlet, it makes the callbacks and Deferreds unnecessary by using greenlets.

Check out the examples: concurrent download of multiple urls, long polling webchat.

Denis Bilenko
+2  A: 

A really interesting comparison of such frameworks was compiled by Nicholas Piël on his blog: it's well worth a read!

jkp
While I agree that article was an interesting read, I think it's worthwhile to consider the validity of the presented benchmarks. See the comments here: http://www.reddit.com/r/programming/comments/ahepg/benchmark_of_asynchronous_servers_in_python/
clemesha
@clemesha, while the point in that reddit page is worth noting, the benchmark was done on a dual core machine and likely was not suffering from fatal flaw described. I suppose it's *possible* both the client and server ran on the same core, but it doesn't seem likely.
Peter Hansen
+1  A: 

There is a good book on the subject: "Twisted Network Programming Essentials", by Abe Fettig. The examples show how to write very Pythonic code, and to me personally, do not strike me as based on a bloated framework. Look at the solutions in the book, if they aren't clean, then I don't know what clean means.

My only enigma is the same I have with other frameworks, like Ruby. I worry, does it scale up? I would hate to commit a client to a framework that is going to have scalability problems.

mrsmoothie
A: 

Also try Syncless. It's coroutine-based (so it's similar to Concurrence, Eventlet and gevent). It implements drop-in non-blocking replacements for socket.socket, socket.gethostbyname (etc.), ssl.SSLSocket, time.sleep and select.select. It's fast. It needs Stackless Python and libevent. It contains a mandatory Python extension written in C (Pyrex/Cython).

pts
+1  A: 

@rhettg (sorry I dont't have enough reputation to comment on it)

I've started to use twisted for some things. The beauty of it almost is because it's "bloated." There are connectors for just about any of the main protocols out there. You can have a jabber bot that will take commands and post to an irc server, email them to someone, run a command, read from an NNTP server, and monitor a web page for changes. The bad news is it can do all of that and can make things overly complex for simple tasks like the OP explained. The advantage of python though is you only include what you need. So while the download may be 20mb, you may only include 2mb of libraries (which is still a lot). My biggest complaint with twisted is although they include examples, anything beyond a basic tcp server you're on your own.

While not a python solution, I've seen node.js gain a lot more traction as of late. In fact I've considered looking into it for smaller projects but I just cringe when I hear javascript :)

vrillusions
+1  A: 

Whizzer is a tiny asynchronous socket framework that uses pyev. Its very fast, primarily because of pyev. It attempts to provide a similiar interface as twisted with some slight changes.

bfrog