views:

72

answers:

2

Hi.

Got a situation where I'm going to be parsing websites. each site has to have it's own "parser" and possibly it's own way of dealing with cookies/etc..

I'm trying to get in my head which would be a better choice.

Choice I: I can create a multiprocessing function, where the (masterspawn) app gets an input url, and in turn it spans a process/function within the masterspawn app that then handles all the setup/fetching/parsing of the page/URL.

This approach would have one master app running, and it in turn creates multiple instances of the internal function.. Should be fast, yes/no?

Choice II: I could create a "Twisted" kind of server, that would essentially do the same thing as Choice I. The difference being that using "Twisted" would also impose some overhead. I'm trying to evaluate Twisted, with regards to it being a "Server" but i don't need it to perform the fetching of the url.

Choice III: I could use scrapy. I'm inclined not to go this route as I don't want/need to use the overhead that scrapy appears to have. As i stated, each of the targeted URLs needs its own parse function, as well as dealing with the cookies...

My goal is to basically have the "architected" solution spread across multiple boxes, where each client box interfaces with a master server that allocates the urls to be parsed.

thanks for any comments on this..

-tom

+2  A: 

There are two dimensions to this question: concurrency and distribution.

Concurrency: either Twisted or multiprocessing will do the job of concurrently handling fetching/parsing jobs. I'm not sure though where your premise of the "Twisted overhead" comes from. On the contrary, the multiprocessing path would incur much more overhead, since a (relatively heavy-weight) OS-process would have to be spawned. Twisteds' way of handling concurrency is much more light-weight.

Distribution: multiprocessing won't distribute your fetch/parse jobs to different boxes. Twisted can do this, eg. using the AMP protocol building facilities.

I cannot comment on scrapy, never having used it.

Peter Sabaini
+1  A: 

For this particular question I'd go with multiprocessing - it's simple to use and simple to understand. You don't particularly need twisted, so why take on the extra complication.

One other option you might want to consider: use a message queue. Have the master drop URLs onto a queue (eg. beanstalkd, resque, 0mq) and have worker processes pickup the URLs and process them. You'll get both concurrency and distribution: you can run workers on as many machines as you want.

Parand