ansaurus

Question

Answer 1

A:

Use a thread pool with a shared list of urls. Each thread tries to pop a url from the list and download it until none are left. pop() from a list is threadsafe

while True:
    try:
        url = url_list.pop()
        # download URL here
    except IndexError:
        break

gnibbler 2010-06-29 10:27:47

Is this method taking care of the number of active threads ?Does the while True consumme CPU ?

binoua 2010-06-30 11:32:01

@binoua, this is the body of the thread. The main program should start the 10 threads. The `while True` doesn't burn CPU. Each cycle takes one url off the list. When there are no urls remaining to be downloaded each thread will exit. Can you post the code you are using so far?

gnibbler 2010-06-30 11:37:39

Answer 2

+3 A:

Using threads doesn't "download files from the Internet faster". You have only one network card and one internet connection so that's just not true.

The threads are being used to wait, and you can't wait faster.

You can use a single thread and be as fast, or even faster -- Just don't wait for the response of one file before starting another. In other words, use asynchronous, non-blocking network programming.

Here's a complete script that uses twisted.internet.task.coiterate to start multiple downloads at the same time, without using any kind of threading, and respecting the pool size (I'm using 2 simultaneous downloads for the demonstration, but you can change the size):

from twisted.internet import defer, task, reactor
from twisted.web import client
from twisted.python import log

@defer.inlineCallbacks
def deferMap(job, dataSource, size=1):
    successes = []
    failures = []

    def _cbGather(result, dataUnit, succeeded):
        """This will be called when any download finishes"""
        if succeeded:
            # you could save the file to disk here
            successes.append((dataUnit, result))
        else:
            failures.append((dataUnit, result))

    @apply
    def work():
        for dataUnit in dataSource:
            d = job(dataUnit).addCallbacks(_cbGather, _cbGather,
                callbackArgs=(dataUnit, True),  errbackArgs=(dataUnit, False))
            yield d

    yield defer.DeferredList([task.coiterate(work) for i in xrange(size)])
    defer.returnValue((successes, failures))

def printResults(result):
    successes, failures = result
    print "*** Got %d pages total:" % (len(successes),)
    for url, page in successes:
        print '  * %s -> %d bytes' % (url, len(page))
    if failures:
        print "*** %d pages failed download:" % (len(failures),)
        for url, failure in failures:
            print '  * %s -> %s' % (url, failure.getErrorMessage())

if __name__ == '__main__':
    import sys
    log.startLogging(sys.stdout)
    urls = ['http://twistedmatrix.com',
            'XXX',
            'http://debian.org',
            'http://python.org',
            'http://python.org/foo',
            'https://launchpad.net',
            'noway.com',
            'somedata',
        ]
    pool = deferMap(client.getPage, urls, size=2) # download 2 at once
    pool.addCallback(printResults)
    pool.addErrback(log.err).addCallback(lambda ign: reactor.stop())
    reactor.run()

Note that I included some bad urls on purpose so we can see some failures in the result:

...
2010-06-29 08:18:04-0300 [-] *** Got 4 pages total:
2010-06-29 08:18:04-0300 [-]   * http://twistedmatrix.com -> 16992 bytes
2010-06-29 08:18:04-0300 [-]   * http://python.org -> 17207 bytes
2010-06-29 08:18:04-0300 [-]   * http://debian.org -> 13820 bytes
2010-06-29 08:18:04-0300 [-]   * https://launchpad.net -> 18511 bytes
2010-06-29 08:18:04-0300 [-] *** 4 pages failed download:
2010-06-29 08:18:04-0300 [-]   * XXX -> Connection was refused by other side: 111: Connection refused.
2010-06-29 08:18:04-0300 [-]   * http://python.org/foo -> 404 Not Found
2010-06-29 08:18:04-0300 [-]   * noway.com -> Connection was refused by other side: 111: Connection refused.
2010-06-29 08:18:04-0300 [-]   * somedata -> Connection was refused by other side: 111: Connection refused.
...

nosklo 2010-06-29 11:19:16

So much code for such a little thing!Yes I know that in my case, threads are usefull only because I can start mutiple downloads at the same time.I'll try to do this but, again, I'm surprised that you're using many hints for such a simple question (@defer, @apply, the twisted module ...).I thought it was possible using basic Python with modules distributed with Python.

binoua 2010-06-30 11:29:57

@binoua: It's not that much code... The code you pasted in your question is **bigger**. And my code could be reduced further, I tried to keep it easy to understand. Another point: My code is also generic: the generic `deferMap` function can make a pool to run anything that returns `Deferred` s. `@apply` is common basic old python. Twisted really helps when doing things asynchronously without threads, but note that twisted is written in pure python and has no dependencies on non-python code. That means you **could** do what it does using only basic modules, but then you'd be only duplicating it

nosklo 2010-06-30 17:31:16

ansaurus

tags:

views:

answers:

Python limited multithreading

related questions