For the past month I've been using Scrapy for a web crawling project I've begun.
This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pages.
I've realized that my initial notion that Scrapy isn't meant for this type of crawl is revealing itself.
I've begun to focus my sights on Nutch and Methabot in hopes of better performance. The only data that I need to store during the crawl is the full content of the web page and preferably all the links on the page (but even that can be done in post-processing).
I'm looking for a crawler that is fast and employs many parallel requests.