views:

161

answers:

3

For the past month I've been using Scrapy for a web crawling project I've begun.

This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pages.

I've realized that my initial notion that Scrapy isn't meant for this type of crawl is revealing itself.

I've begun to focus my sights on Nutch and Methabot in hopes of better performance. The only data that I need to store during the crawl is the full content of the web page and preferably all the links on the page (but even that can be done in post-processing).

I'm looking for a crawler that is fast and employs many parallel requests.

+1  A: 

This my be fault of server not Scrapy. Server may be not so fast as you want or may be it (or webmaster) detects crawling and limit speed for this connection/cookie. Do you use proxy? This may slow down crawling too. This may be Scrapy wisdom, if you will crawl too intensive you may get ban on this server. For my C++ handwritten crawler I artificially set 1 request per second limit. But this speed is enough for 1 thread ( 1 req * 60 secs * 60 minutes * 24 hours = 86400 req / day ). If you interested you may write email to whalebot.helmsman {AT} gmail.com .

whalebot.helmsman
A: 

Scrapy allows you to determine the number of concurrent requests and the delay between the requests in its settings.

Tim McNamara
A: 

Do you know where the bottleneck is?. As whalebot.helmsman pointed out, the limit may not be on Scrapy itself, but on the server you're crawling.

You should start by finding out whether the bottleneck is the network or CPU.

Pablo Hoffman