views:

185

answers:

4

I was sure there was something like this in the standard library, but it seems I was wrong.

I have a bunch of urls that I want to urlopen in parallel. I want something like the builtin map function, except the work is done in parallel by a bunch of threads.

Is there a good module that does this?

+1  A: 

The Python module Queue might help you. Use one thread that uses Queue.put() to push all urls into the queue and the worker threads simply get() the urls one by one.

Python Docs: queue — A synchronized queue class

cypheon
A: 

I'd wrap it up in a function (untested):

import itertools
import threading
import urllib2
import Queue

def openurl(url, queue):
    def starter():
        try:
            result = urllib2.urlopen(url)
        except Ecxeption, exc:
            def raiser():
                raise exc
            queue.put((url, raiser))
        else:
            queue.put((url, lambda:result))
    threadind.Thread(target=starter).start()

myurls = ... # the list of urls
myqueue = Queue.Queue()

map(openurl, myurls, itertools.repeat(myqueue))

for each in myurls:
    url, getresult = queue.get()
    try:
        result = getresult()
    except Exception, exc:
        print 'exception raised:' + str(exc)
    else:
        # do stuff with result
pillmuncher
+4  A: 

There is a map method in multiprocessing.Pool. That does multiple processes.

And if multiple processes aren't your dish, you can use multiprocessing.dummy which uses threads.

import urllib
import multiprocessing.dummy

p = multiprocessing.dummy.Pool(5)
def f(post):
    return urllib.urlopen('http://stackoverflow.com/questions/%u' % post)

print p.map(f, range(3329361, 3329361 + 5))
Scott Robinson
A: 

Someone recommended I use the futures package for this. I tried it and it seems to be working.

http://pypi.python.org/pypi/futures

Here's an example:

"Download many URLs in parallel."

import functools
import urllib.request
import futures

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

def load_url(url, timeout):
    return urllib.request.urlopen(url, timeout=timeout).read()

with futures.ThreadPoolExecutor(50) as executor:
   future_list = executor.run_to_futures(
           [functools.partial(load_url, url, 30) for url in URLS])
cool-RR