views:

306

answers:

3

I'm writing a simple site spider and I've decided to take this opportunity to learn something new in concurrent programming in Python. Instead of using threads and a queue, I decided to try something else, but I don't know what would suit me.

I have heard about Stackless, Celery, Twisted, Tornado, and other things. I don't want to have to set up a database and the whole other dependencies of Celery, but I would if it's a good fit for my purpose.

My question is: What is a good balance between suitability for my app and usefulness in general? I have taken a look at the tasklets in Stackless but I'm not sure that the urlopen() call won't block or that they will execute in parallel, I haven't seen that mentioned anywhere.

Can someone give me a few details on my options and what would be best to use?

Thanks.

+2  A: 

I must say that Twisted gets my vote.

Performing event-drive tasks is fairly straightforward in Twisted. Integration with other important system components such as GTK+ and DBus is very easy.

The HTTP client support is basic for now but improving (>9.0.0): see related question.

The added bonus is that Twisted is available in the Ubuntu default repository ;-)

jldupont
Hmm, I saw that and it looks very interesting, apparently I can fire off N requests and have each callback fire one more, that would (hopefully) keep the number of requests constant. Thanks for this!
Stavros Korokithakis
jldupont
+4  A: 

Tornado is a web server, so it wouldn't help you much in writing a spider. Twisted is much more general (and, inevitably, complex), good for all kinds of networking tasks (and with good integration with the event loop of several GUI frameworks). Indeed, there used to be a twisted.web.spider (but it was removed years ago, since it was unmaintained -- so you'll have to roll your own on top of the facilities Twisted does provide).

Alex Martelli
+1  A: 

For a quick look at package sizes, see ohloh.net/p/compare .
Of course source size is only a rough metric (what I'd really like is nr pages doc, nr pages examples, dependencies), but it can help.

Denis