views:

524

answers:

2

I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:

  1. Run forever

    Means it will periodical re-visit some portal pages to get updates.

  2. Schedule priorities.

    Give different priorities to different type of URLs.

  3. Multi thread fetch

I've read the Scrapy document but have not found something related to what I listed (maybe I am not careful enough). Is there anyone here know how to do that ? or just give some idea/example about it. Thanks!

A: 

To make it run forever: Invoke this PHP from a cron job by using wget or something else to fetch the PHP link.

To make it multithreaded fetch http://dev.scrapy.org/wiki/CrawlerThread

You might assign thread priorities based on the URL type.

anijhaw
+1  A: 

Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.

  1. Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.
  2. Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.
  3. Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.

Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.

msw
Thanks! In my understanding, the Spiders seems works for "one time" job (just crawl everything specified and quit). So do you mean if I want a long-running crawler, I should write the application myself and call spider to do the job. It is not easy to implement the long-running logic inside Scrapy by middleware or something else, right?
superb
You could probably implement re-spider logic in the Spider Middleware layer, but the primitives don't seem well suited for it and my gut feel is that you'd be pushing application layer logic down into the presentation level (if I may be allowed to misuse OSI terminology).http://doc.scrapy.org/topics/spider-middleware.html
msw