views:

75

answers:

4

I have a python script that download web page, parse it and return some value from the page. I need to scrape a few such pages for getting the final result. Every page retrieve takes long time (5-10s) and I'd prefer to make requests in parallel to decrease wait time.
The question is - which mechanism will do it quick, correctly and with minimal CPU/Mem waste? Twisted, asyncore, threading, something else? Could you provide some link with examples?
Thanks

UPD: There's a few solutions for the problem, I'm looking for the compromise between speed and resources. If you could tell some experience details - how it's fast under load from your view, etc - it would be very helpful.

+3  A: 

multiprocessing

Spawn a bunch of processes, one for each URL you want to download. Use a Queue to hold a list of URLs, and make the processes each read a URL off the queue, process it, and return a value.

katrielalex
+3  A: 

multiprocessing.Pool can be a good deal, there are some useful examples. For example if you have a list of urls, you can map the contents retrieval in a concurrent way:

def process_url(url):
    # Do what you want
    return what_you_want

pool = multiprocessing.Pool(workers=4) # how much parallelism?
pool.map(process_url, list_of_urls)
pygabriel
+1  A: 

See my answer here http://stackoverflow.com/questions/3491455

gnibbler
Hey, Gnibbler added a profile pic!
twneale
A: 

Use an asynchronous, i.e. event-driven rather than blocking, networking framework for this. One option is to use twisted. Another option that has recently become available is to use monocle. This mini-framework hides the complexities of non-blocking operations. See this example. It can use twisted or tornado behind the scenes, but you don't really notice much of it.

loevborg