ansaurus

Question

Scraping landing pages of a list of domains

Answer 1

+1 A:

If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:

import threading
import urllib

maxthreads = 4

sites = ['google.com', 'yahoo.com', ] # etc.

class Download(threading.Thread):
   def run (self):
       global sites
       while sites:
           site = sites.pop()
           print "start", site
           urllib.urlretrieve('http://' + site, site)
           print "end  ", site

for x in xrange(min(maxthreads, len(sites))):
    Download().start()

You could also check out httplib2 or PycURL to do the downloading for you instead of urllib.

I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree from the standard library or you could install BeautifulSoup (which would be better as it handles malformed markup).

Ian Mackinnon 2010-08-08 01:42:34

ansaurus

tags:

views:

answers:

Scraping landing pages of a list of domains

related questions