views:

217

answers:

4

Hi,I want to batch dowload webpages in one site. There are 5000000 urls links in my 'urls.txt' file. It's about 300M. How make a multi-threads link these urls and dowload these webpages? or How batch dowload these webpages?

my ideas:

with open('urls.txt','r') as f: for el in f: ##fetch these urls

or twisted?

Is there a good solution for it?

thanks

+1  A: 

Definitely downloading 5M web pages in one go is not a good idea, because you'll max out a lot of things, including your network bandwidth and your OS's file descriptors. I'd go in batches of 100-1000. You can use urllib.urlopen to get a socket and then just read() on several threads. You may be able to use select.select. If so, then go ahead and download all 1000 at once and distribute each file handle that select returns to say 10 worker threads. If select won't work, then limit your batches to 100 downloads and use one thread per download. Certainly you shouldn't start more than 100 threads as your OS might blow up or at least go a bit slow.

abc
+3  A: 

If this isn't part of a larger program, then notnoop's idea of using some existing tool to accomplish this is a pretty good one. If a shell loop invoking wget solves your problem, that'll be a lot easier than anything involving more custom software development.

However, if you need to fetch these resources as part of a larger program, then doing it with shell may not be ideal. In this case, I'll strongly recommend Twisted, which will make it easy to do many requests in parallel.

A few years ago I wrote up an example of how to do just this. Take a look at http://jcalderone.livejournal.com/24285.html.

Jean-Paul Calderone
thanks :) It's great!
bell007
+1  A: 

First parse your file and push the urls into a queue then spawn 5-10 worker threads to pull urls out of the queue and download. Queue's are your friend with this.

fridder
thanks! "Queue's are your friend with this." :)
bell007
A: 

A wget script is probably simplest, but if you're looking for a python-twisted crawling solution, check out scrapy

Jacob