I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?
(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )
So, I have a start URLs list like
start_urls=[google.com yahoo.com aol.com]
And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.
Scrapy works as expected, but seems slow (About 1000 in an hour or 1 every 4 seconds). Is there a way to speed this up by increasing the number of CONCURRENT_REQUESTS_PER_SPIDER while running a single spider? Anything else?