views:

64

answers:

1

Hi there,

I'm currently retrieving and parsing pages from a website using urllib2. However, there are many of them (more than 1000), and processing them sequentially is painfully slow.

I was hoping there was a way to retrieve and parse pages in a parallel fashion. If that's a good idea, is it possible, and how do I do it?

Also, what are "reasonable" values for the number of pages to process in parallel (I wouldn't want to put too much strain on the server or get banned because I'm using too many connections)?

Thanks!

+3  A: 

You can always use threads (i.e. run each download in a separate thread). For large numbers, this could be a little too resource hogging, in which case I recommend you take a look at gevent and specifically this example, which may be just what you need.

(from gevent.org: "gevent is a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libevent event loop")

adamk
That looks good, I'll check it out. Thanks!
Anthony Labarre