ansaurus

Question

How can I speed up fetching pages with urllib2 in python?

Answer 1

A:

The actual wait is probably not in urllib2 but in the server and/or your network connection to the server.

There are 2 ways of speeding this up.

Keep the connection alive (see this question on how to do that: http://stackoverflow.com/questions/1037406/python-urllib2-with-keep-alive)
Use multiplle connections, you can use threads or an async approach as Aaron Gallagher suggested. For that, simply use any threading example and you should do fine :) You can also use the multiprocessing lib to make things pretty easy.

WoLpH 2010-08-16 02:08:49

Thanks WoLpH! Much appreciated :) Will keeping the connection alive work even if I'm grabbing different web pages?

Parker 2010-08-16 02:42:59

@Parker: the up arrow to the left of the arrow says "This answer is useful" and at 15+ reputation, you can click it in addition to the accept checkmark.

msw 2010-08-16 03:00:11

-1 for suggesting threads. This is IO-bound; threads are useless here.

Aaron Gallagher 2010-08-16 03:14:36

@Aaron, usually threads work brilliantly for downloading webpages. The process won't be I/O bound unless it's downloading really large files or the latency is very low. urllib2 will typically spend most of it's time blocked, waiting for a response which is perfect conditions for Pythons GIL/threading

gnibbler 2010-08-16 03:21:12

@gnibbler, no, that's what IO bound *means*: the process spends most of its time waiting on IO. Multiple threads don't make you wait for data any faster. Just use nonblocking IO; there's no extra code complexity or locking overhead.

Aaron Gallagher 2010-08-16 03:28:22

@Aaron, sure you are correct about the definition of IO Bound, but wrong about the effectiveness of threading to download a bunch of urls.

gnibbler 2010-08-16 06:51:58

@gnibbler, I never once said it wasn't *effective*. I've only been claiming that it's not worth the numerous pitfalls and caveats associated with it, which most of the answers conveniently gloss over or ignore.

Aaron Gallagher 2010-08-16 07:05:49

@Aaron Gallagher: threads are far from useless here. Yes it is IO bound, but not _your_ IO. It's the latency, bandwidth and server response time that will limit you. Threading is by far the most effective way of making a download system faster.

WoLpH 2010-08-16 10:47:08

@Parker: keeping the connection alive will only work as long as you stay on the same server. If the new page will be on a different webserver than it will fail. So a good guess is... if it is still on the same domain and subdomain, than you can keep the connection alive. Otherwise you probably don't want to try.

WoLpH 2010-08-16 10:49:27

@WoLpH, did you miss the part where I posted benchmarks of twisted beating out two different implementations of threaded downloaders? Twisted runs in a single thread, using an event loop. There is no locking overhead. Asynchronous IO, through an event loop, is the most effective way of making a download faster.

Aaron Gallagher 2010-08-16 13:22:31

@Aaron Gallagher: yes, I didn't see your answer yet. Letting it work async is a nice way of fixing the problem indeed. On the lower level it doesn't make much difference however. Using async is a nice way of emulating multiple threads indeed. And for this purpose a better solution. However... I would have opted for a plain Python solution instead of using the huge Twisted framework.

WoLpH 2010-08-16 14:51:20

@WoLpH, uh, event loops are not in any way trying to "emulate" multiple threads. If anything, threads try to "emulate" being an event loop by trying to turn a blocking API into something that can be used asynchronous.

Aaron Gallagher 2010-08-16 15:16:25

@Aaron Gallagher: Alright, rephrased to emulate the behaviour of multiple threads running simultaneously. If you want to go down a couple of levels than you will simply end up with CPython executing everything sequentially on a single processor and switching as soon as one of the threads blocks. The system is quite comparable really, with threads your method gets cpu time once it stops blocking. With async your method gets called once it stops blocking. It is just a different interface for the same technology.

WoLpH 2010-08-16 15:31:06

@WoLpH: Twisted is a plain python solution -- it is written in python.

nosklo 2010-08-17 01:57:59

@nosklo: Do you seriously don't know what I meant or are you just trying to start a pointless debate? Either way, when I was talking about `a plain Python solution` I meant using the Python base library versus the use of a huge library.

WoLpH 2010-08-17 02:11:49

@WoLpH: Twisted is written in python and opensource. You could just copy what twisted does to inside your code -- then you'll have a plain python solution, using only the base library. The part that does this isn't that big.

nosklo 2010-08-17 21:41:24

@nosklo: I'll have to take your word for it not being that big. My experience with Twisted has been quite the opposite.

WoLpH 2010-08-17 22:13:21

@WoLpH: That's irrelevant anyway. Point is that the best, fastest, correct™ way of doing multiple downloads in parallel, using only python, is to do what twisted does: use asynchronous code. You could write it yourself, but Twisted is already written and well tested, so why not use it? My hard drive is 500GB so it fits twisted many times, the size does not matter

nosklo 2010-08-18 18:01:05

@nosklo: The size does matter. No, not in terms of storage ofcourse. But in terms of readability. A huge codebase will take more time to learn/understand than a small codebase. So an `asyncore` example would be more fit here.

WoLpH 2010-08-19 01:13:08

Answer 2

A:

Fetching webpages obviously will take a while as you're not accessing anything local. If you have several to access, you could use the threading module to run a couple at once.

Here's a very crude example

import threading
import urllib2
import time

urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []

class PageFetch(threading.Thread):
    def __init__(self, url, datadump):
        self.url = url
        self.datadump = datadump
        threading.Thread.__init__(self)
    def run(self):
        page = urllib2.urlopen(self.url)
        self.datadump.append(page.read()) # don't do it like this.

print "Starting threaded reads:"
start = time.clock()
for url in urls:
    PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)

print "Starting sequential reads:"
start = time.clock()
for url in urls:
    page = urllib2.urlopen(url)
    data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)

for i,x in enumerate(data1):
    print len(data1[i]), len(data2[i])

This was the output when I ran it:

Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483

Grabbing the data from the thread by appending to a list is probably ill-advised (Queue would be better) but it illustrates that there is a difference.

Nick T 2010-08-16 02:08:53

Thanks Nick! Much appreciated :)

Parker 2010-08-16 02:43:44

And may I ask why self.datadump.append(page.read()) # don't do it like this. is ill advised?

Parker 2010-08-16 03:09:19

-1 for suggesting threads. This is IO-bound; threads are useless here.

Aaron Gallagher 2010-08-16 03:19:46

@Aaron Gallagher Why did it run over twice as fast using threads?

Nick T 2010-08-16 03:53:30

@Nick, I never denied that your code can execute in less time. The problem is that the means by which you achieve that produce unsustainable, overcomplicated code compared to using async IO.

Aaron Gallagher 2010-08-16 04:27:15

Answer 3

+8 A:

Use twisted! It makes this kind of thing absurdly easy compared to, say, using threads.

from twisted.internet import defer, reactor
from twisted.web.client import getPage
import time

def processPage(page, url):
    # do somewthing here.
    return url, len(page)

def printResults(result):
    for success, value in result:
        if success:
            print 'Success:', value
        else:
            print 'Failure:', value.getErrorMessage()

def printDelta(_, start):
    delta = time.time() - start
    print 'ran in %0.3fs' % (delta,)
    return delta

urls = [
    'http://www.google.com/',
    'http://www.lycos.com/',
    'http://www.bing.com/',
    'http://www.altavista.com/',
    'http://achewood.com/',
]

def fetchURLs():
    callbacks = []
    for url in urls:
        d = getPage(url)
        d.addCallback(processPage, url)
        callbacks.append(d)

    callbacks = defer.DeferredList(callbacks)
    callbacks.addCallback(printResults)
    return callbacks

@defer.inlineCallbacks
def main():
    times = []
    for x in xrange(5):
        d = fetchURLs()
        d.addCallback(printDelta, time.time())
        times.append((yield d))
    print 'avg time: %0.3fs' % (sum(times) / len(times),)

reactor.callWhenRunning(main)
reactor.run()

This code also performs better than any of the other solutions posted (edited after I closed some things that were using a lot of bandwidth):

Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 29996)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.518s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.461s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30033)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.435s
Success: ('http://www.google.com/', 8117)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.449s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.547s
avg time: 0.482s

And using Nick T's code, rigged up to also give the average of five and show the output better:

Starting threaded reads:
...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
Starting threaded reads:
...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
Starting threaded reads:
...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
Starting threaded reads:
...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
Starting threaded reads:
...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
avg time: 1.775s

Starting sequential reads:
...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
Starting sequential reads:
...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
Starting sequential reads:
...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
Starting sequential reads:
...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
Starting sequential reads:
...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
avg time: 1.439s

And using Wai Yip Tung's code:

Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30051 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.704s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.845s
Fetched 8153 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30070 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.689s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.647s
Fetched 8135 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30349 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.693s
avg time: 0.715s

I've gotta say, I do like that the sequential fetches performed better for me.

Aaron Gallagher 2010-08-16 03:20:01

I do like that I've gotten -2 with no comments! Come on, downvoters, try to show that my code is bad~

Aaron Gallagher 2010-08-16 14:13:34

No downvote from me since it's a proper solution. But do you have a plain `Python` version instead of using the huge `Twisted` framework?

WoLpH 2010-08-16 14:55:30

Your benchmarks a mildly flawed imho. You are benchmarking the great search engines which will always respond nearly instant. When using your solution with normal websites the sequential fetches will perform worse because than the bottleneck will be on the server side/internet instead of your Python code.

WoLpH 2010-08-16 14:57:15

@WoLpH, I modified the other code I tested to request the same sites. See how the lengths are all basically the same?

Aaron Gallagher 2010-08-16 15:11:28

@WoLpH, also, "huge"? Twisted is quite a bit smaller than python.

Aaron Gallagher 2010-08-16 15:15:06

I am not disputing that. I am saying that most websites will not respond as fast as the major search engines. When testing with any regular website with lots of content your results will be completely different. Sequentially fetching results will only be faster in cases like these where your Python code is actually the bottleneck.

WoLpH 2010-08-16 15:16:51

According to `SLOCCount` the twisted source has `144,898` physical source lines of code. Such a codebase is huge in my book.If you want the person that asked the question to actually understand the code he's using, it will be hard to read through all used code in Twisted.

WoLpH 2010-08-16 15:20:52

@WoLpH, that's probably including all of the unit tests, and all of the optional packages that wouldn't be necessary for something as simple as fetching web pages. And, again, I guarantee that the number of lines used in python itself just to invoke `urllib2` is going to be greater. And regarding the benchmarks, I could pick another bunch of sites, but the original code used only `docs.python.org`, which is dog slow and a bit unreliable on my connection.

Aaron Gallagher 2010-08-16 15:25:07

@Aaron Gallagher: Yes, there is bunch of other code in `Twisted` that is not used here. But it doesn't negate the fact that the amount of code you'll have to read through with `Twisted` will be substantial. As for reading `urllib2`, that's not the point here. The working of `urllib2` is not the question, it's the working of either a `threading` approach or the `async` approach.well... that's the point. On slow websites the results are completely different. And most websites are closer to `docs.python.org` than `google.com` in terms of performance.

WoLpH 2010-08-16 15:37:27

I would use Twisted, but I want to print the data of the fetched pages in a certain order. Can I do this with twisted? It seems I might reach the part of the script that pints the info before it actually arrives. Can I make my script pause until the data is received?

Parker 2010-08-16 15:38:03

@Parker, that's exactly what the `DeferredList` does. Here's a link to another answer I wrote that describes how it works a bit better: http://stackoverflow.com/questions/3488854/making-a-python-program-wait-until-twisted-deferred-returns-a-value/3489088#3489088

Aaron Gallagher 2010-08-16 15:41:04

Hmm, I tried your example code, but the script just hung and wouldn;t do anything.

Parker 2010-08-16 19:43:43

@Parker: The reactor endlessly runs, there are probably some ways to have it automatically die when you're done, but if you're doing a quick script it's overkill. If your application http://bluedevilbooks.com/search/?DEPT=MATH while it's a fine tool, it is not needed to replace every instance of urllib.

Nick T 2010-08-16 22:34:17

Speaking of which, you could toss some example pages you're getting into the question to provide more applicable data. (versus instantly getting homepages from massive websites, which would be where threads don't perform as well)

Nick T 2010-08-16 22:37:12

Nick, thanks! I updated the question for you. I've been trying the threads but the script just locks up and doesn;t do anything.

Parker 2010-08-16 23:16:21

@Parker, If you have a large list of urls this approach may not work well for you as it opens one connection per url more or less simultaneously. This may be causing your internet connection to choke up. Try running a smaller number of urls at a time to see if that helps

gnibbler 2010-08-17 03:41:25

Answer 4

+1 A:

EDIT: I'm expanding the answer to include a more polished example. I have found a lot hostility and misinformation in this post regarding threading v.s. async I/O. Therefore I also adding more argument to refute certain invalid claim. I hope this will help people to choose the right tool for the right job.

This is a dup to a question 3 days ago.

Python urllib2.open is slow, need a better way to read several urls - Stack Overflow http://stackoverflow.com/questions/3472515/python-urllib2-open-is-slow-need-a-better-way-to-read-several-urls/3472905#3472905

I'm polishing the code to show how to fetch multiple webpage in parallel using threads.

import time
import threading
import Queue

# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
    result = Queue.Queue()
    # wrapper to collect return value in a Queue
    def task_wrapper(*args):
        result.put(target(*args))
    threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return result

def dummy_task(n):
    for i in xrange(n):
        time.sleep(0.1)
    return n

# below is the application code
urls = [
    ('http://www.google.com/',),
    ('http://www.lycos.com/',),
    ('http://www.bing.com/',),
    ('http://www.altavista.com/',),
    ('http://achewood.com/',),
]

def fetch(url):
    return urllib2.urlopen(url).read()

run_parallel_in_threads(fetch, urls)

As you can see, the application specific code has only 3 lines, which can be collapsed into 1 line if you are aggressive. I don't think anyone can justify their claim that this is complex and unmaintainable.

Unfortunately most other threading code posted here has some flaws. Many of them do active polling to wait for the code to finish. join() is a better way to synchronize the code. I think this code has improved upon all the threading examples so far.

keep-alive connection

WoLpH's suggestion about using keep-alive connection could be very useful if all you URLs are pointing to the same server.

twisted

Aaron Gallagher is a fans of twisted framework and he is hostile any people who suggest thread. Unfortunately a lot of his claims are misinformation. For example he said "-1 for suggesting threads. This is IO-bound; threads are useless here." This contrary to evidence as both Nick T and I have demonstrated speed gain from the using thread. In fact I/O bound application has the most to gain from using Python's thread (v.s. no gain in CPU bound application). Aaron's misguided criticism on thread shows he is rather confused about parallel programming in general.

Right tool for the right job

I'm well aware of the issues pertain to parallel programming using threads, python, async I/O and so on. Each tool has their pros and cons. For each situation there is an appropriate tool. I'm not against twisted (though I have not deployed one myself). But I don't believe we can flat out say that thread is BAD and twisted is GOOD in all situations.

For example, if the OP's requirement is to fetch 10,000 website in parallel, async I/O will be prefereable. Threading won't be appropriable (unless maybe with stackless Python).

Aaron's opposition to threads are mostly generalizations. He fail to recognize that this is a trivial parallelization task. Each task is independent and do not share resources. So most of his attack do not apply.

Given my code has no external dependency, I'll call it right tool for the right job.

Performance

I think most people would agree that performance of this task is largely depend on the networking code and the external server, where the performance of platform code should have negligible effect. However Aaron's benchmark show an 50% speed gain over the threaded code. I think it is necessary to response to this apparent speed gain.

In Nick's code, there is an obvious flaw that caused the inefficiency. But how do you explain the 233ms speed gain over my code? I think even twisted fans will refrain from jumping into conclusion to attribute this to the efficiency of twisted. There are, after all, a huge amount of variable outside of the system code, like the remote server's performance, network, caching, and difference implementation between urllib2 and twisted web client and so on.

Just to make sure Python's threading will not incur a huge amount of inefficiency, I do a quick benchmark to spawn 5 threads and then 500 threads. I am quite comfortable to say the overhead of spawning 5 thread is negligible and cannot explain the 233ms speed difference.

In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>

In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s

In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s       <<<<<<<< This means 0.13s of overhead

Further testing on my parallel fetching shows a huge variability in the response time in 17 runs. (Unfortunately I don't have twisted to verify Aaron's code).

0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s

My testing does not support Aaron's conclusion that threading is consistently slower than async I/O by a measurable margin. Given the number of variables involved, I have to say this is not a valid test to measure the systematic performance difference between async I/O and threading.

Wai Yip Tung 2010-08-16 06:05:20

See the other comment I just left: I never said that threads can't be effective in this situation. It's just not worth the problems with threads that everyone seems to forget or ignore in their answers. Here is an enlightening graphic: http://www.erights.org/elib/concurrency/images/badtradeoff.gif

Aaron Gallagher 2010-08-16 07:08:25

This isn't an answer, it's three comments. Please don't abuse the Q/A system, comment as necessary.

Devin Jeanpierre 2010-08-16 07:44:36

@Aaron, you say thread is useless twice. Something cannot be useless if it is effective.

Wai Yip Tung 2010-08-16 18:07:02

@Devin, I'm going to expand my response to contain both an answer and rebuke arguments. There is significant amount of misinformation in this discussion. Unfortunately I need more space than a small piece of comment for a rebuke. This is an adaptive use of the Q/A system. Mostly important I want people to choose the right tool for the right job, and not to reject a tool due to misinformation.

Wai Yip Tung 2010-08-16 18:12:30

Wai Yip Tung,Amazing write up, you explained quite a bit, I appreciate that! I tried running your code, I just can't figure out what the data (the urlopen().read() ) is called.

Parker 2010-08-16 23:45:31

Scratch that, I got it now! I'm going to put it into my script and I'll let you know how it goes.

Parker 2010-08-16 23:47:47

Hmm, I seem to be getting <class 'Queue.Empty'>: args = () message = ''When I put your code in my script. Do I have to have the () and the , around each URL? If I do that, it kind of works, but one of the URLs becomes invalid.

Parker 2010-08-17 00:07:11

@Parker, I've polished the code even more. I posted it as a recipe on ASPN http://code.activestate.com/recipes/577360-a-multithreaded-concurrent-version-of-map/. It should be really easy to use.

Wai Yip Tung 2010-08-17 00:13:58

@Parker, I just read about your problem. Try the ASPN's version. I have adopted the map interface and dropped of the more clunky Queue and the (,) issue that has tripped you.

Wai Yip Tung 2010-08-17 00:16:46

"Each task is independent and do not share resources." So, why use threads at all? Threads "share resources" *by definition*. Maybe this whole time you've been trying to suggest using a process pool for fetching pages, but were unaware of the differences between processes and threads.

Aaron Gallagher 2010-08-17 01:35:52

@Wai, that's a false dichotomy. Threads are useless here because they add extra complexity that an event loop wouldn't add. Did you look at the graphic I posted in my first comment?

Aaron Gallagher 2010-08-17 01:38:50

Jeez, this answer is just chock full of wrong. I keep finding more things. One of the biggest problems with threads is that they can *appear* to be simple, when there's a lot of things going on that aren't obvious at all. You claim that because your code is three lines long, it can't be complex or unmaintainable. I can't vouch for whether urllib2 is thread-safe, but there's a number of things that *aren't* thread safe, and will break in subtle ways when run with `run_parallel_in_threads`. The complexity is still there, but deferred to other places.

Aaron Gallagher 2010-08-17 01:46:40

And I'm really confused as to how you can say that threading is not consistently slower than async IO when *you didn't even test async IO*. I'm the only one who's posted benchmarks using twisted, and my benchmarks *do* show a consistent difference.

Aaron Gallagher 2010-08-17 01:51:32

I don't need to test async I/O. The reason I say that is the range of my own testing is differ by as much as 2.51s, it will not be valid for someone to claim an alternative solution is consistently faster by a much smaller margin. Unless the alternative code is slower than this code by a margin a lot greater than 2.51, then we can claim it is consistently slower than this code.

Wai Yip Tung 2010-08-17 03:55:51

@Aaron, you have not found any problem. You just making FUD claims. If you found anything in urllib2, or any other part of Python that's not thread safe, please file a bug. There are tons of production software using threads. If Python is not designed to be thread safe they will all be idiot to use it in production.

Wai Yip Tung 2010-08-17 04:02:12

Threads "share resources" by definition? In what sense? Can you point out what resources it is sharing in this code? And please don't make stupid suggestion that I confuse process with thread. I cannot possibly be such as idiot.

Wai Yip Tung 2010-08-17 04:08:26

Wai Yip Tung,I cannot express enough how much of a life saver you have been. You constantly came back to address any issues I had and you went out of your way to explain everything and why I should use what and how to use it.I really appreciate what you've done for me! Thanks so much!

Parker 2010-08-17 04:37:51

Off-topic; how does this answer have 5 downvotes? Twisted fans are quite vindictive. :P

Nick T 2010-08-17 06:47:32

I don't know, but it kind of scares me off from using the module :PI don't see what the "issue with threads" is. For my purposes, I'm just loading a few URLs, I don't need to import some massive library when I can just fetch them in parallel. Also, the threading lets me wait for all of the pages to be fetched before moving on, whereas Twisted will try to keep going and cause issues.

Parker 2010-08-17 14:27:24

@parker, you're welcome. I'm glad that it has helped you.

Wai Yip Tung 2010-08-18 05:57:48

@Nick, it is not just me. Almost everyone gets a down votes, presumably on the ground that thread is bad. Usually people on stackoverflow are quite civilized even when they disagree. This is the only group of people who go a great length to vote people down.

Wai Yip Tung 2010-08-18 06:00:40

Answer 5

+2 A:

Here is an example using python Threads. The other threaded examples here launch a thread per url, which is not very friendly behaviour if it causes too many hits for the server to handle (for example it is common for spiders to have many urls on the same host)

from threading import Thread
from urllib2 import urlopen
from time import time, sleep

WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = []

class Worker(Thread):
    def run(self):
        while urls:
            url = urls.pop()
            results.append((url, urlopen(url).read()))

start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)

while len(results)<40:
    sleep(0.1)
print time()-start

Note: The times given here are for 40 urls and will depend a lot on the speed of your internet connection and the latency to the server. Being in Australia, my ping is > 300ms

With WORKERS=1 it took 86 seconds to run
With WORKERS=4 it took 23 seconds to run
with WORKERS=10 it took 10 seconds to run

so having 10 threads downloading is 8.6 times as fast as a single thread.

Here is an upgraded version that uses a Queue. There are at least a couple of advantages.
1. The urls are requested in the order that they appear in the list
2. Can use q.join() to detect when the requests have all completed
3. The results are kept in the same order as the url list

from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue

WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
        'http://docs.python.org/library/thread.html',
        'http://docs.python.org/library/multiprocessing.html',
        'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)

def worker():
    while True:
        i, url = q.get()
        # print "requesting ", i, url       # if you want to see what's going on
        results[i]=urlopen(url).read()
        q.task_done()

start = time()
q = Queue()
for i in range(WORKERS):
    t=Thread(target=worker)
    t.daemon = True
    t.start()

for i,url in enumerate(urls):
    q.put((i,url))
q.join()
print time()-start

gnibbler 2010-08-16 07:57:46

Not according to my benchmarks. I think you're doing something wrong.

Aaron Gallagher 2010-08-16 13:21:02

@Aaron, The program is right there. It's pretty simple. Why do you think I am doing something wrong?

gnibbler 2010-08-16 13:41:15

Well, I'm just going to stop right after "you're not using any thread-safe data structures to communicate through threads" because that's a painfully amateurish mistake.

Aaron Gallagher 2010-08-16 13:49:28

@Aaron, Do you mean `list.pop()` and `list.append()`? They are guaranteed to be thread safe in Python.

gnibbler 2010-08-16 13:52:28

Guaranteed by whom? Do you have a link to a document that espouses this?(And if it were true, why does the `Queue` module exist?)

Aaron Gallagher 2010-08-16 13:55:41

http://effbot.org/zone/thread-synchronization.htm#atomic-operations Using a Queue here instead would not be difficult. A list is adequate for this simple example as I don't mind fetching the url's in reverse order. Obviously popping from the beginning of a list over and over is not very efficient

gnibbler 2010-08-16 14:34:22

@gnibbler, this document is completely incorrect. I can't say that it's just out of date because, well, I don't know when this was ever true. None of these operations are atomic; I can point out times when each one of them could yield execution to another thread. (A simple example: reading an instance attribute has at least a dozen different ways to invoke arbitrary python. `__getattr__` is one everyone knows of.) This would be too long for just one comment, so if you were to make a new question for this, I'd be glad to list the problems.

Aaron Gallagher 2010-08-16 15:21:00

I would like to listen to you to show how list.append is not atomic. I've looked at the byte code - dis.dis(compile("[].append(1)","","exec")). The append happens in instruction #9. It looks atomic to me.

Wai Yip Tung 2010-08-16 19:50:36

@Aaron, Queue does more than transferring data atomically. It is a bounded buffer, meaning it can block producer or consumer until data or space is available for synchronization purpose.

Wai Yip Tung 2010-08-16 20:56:09

gnibbler, thanks! :)However, when I run the script (the second one) I copied word for word, and just replaced the URLs with urls = getURLS(), it just keeps running. It won;t display anything or stop.

Parker 2010-08-16 23:38:14

@Parker, have you tried adding the print statement where I indicated? How many urls does getURLS return? Perhaps it is just taking a long time.

gnibbler 2010-08-16 23:42:37

@Wai, just because it's implemented in one opcode in *one part of* the bytecode doesn't mean that it's an atomic operation. Calling a python function, for example, only takes one opcode. Would you say that calling an arbitrary python function is atomic?

Aaron Gallagher 2010-08-17 01:41:03

@Wai, and, uh, I never disagreed on what `Queue` is for? I don't understand what you're trying to say with that comment.

Aaron Gallagher 2010-08-17 01:54:38

@Aaron, about Queue, let me remind you of the context. You was challenging gnibbler's claim that list.append is atomic. And you say Queue would not exist if list.append is atomic. I was reminding you that Queue's primary purpose is to implement bounded buffer.

Wai Yip Tung 2010-08-17 03:40:46

Wai Yip Tung 2010-08-17 03:48:45

Ah, got it gnibbler. Thanks!

Parker 2010-08-17 04:36:56

ansaurus

tags:

views:

answers:

How can I speed up fetching pages with urllib2 in python?

related questions