views:

58

answers:

0

I have been working on a spider that gathers data for research using Scrapy. It crawls around 100 sites that each have a large amount of links within them. I need to specifly were the spider crawls so that I can tell the spider to collect data from certain parts of the site, while not crawling others to save time. I have been having much difficulty figuring out how to do this efficiently (All of the sites have the same format and structure, only different domains). At this time, I only want the spider to follow links in specific parts of the websites, but I do not know how to control this well. I have been using SgmlLinkExtractor Rules along with its functions allow and restrict_xpaths to control were the spider crawls thus far, but when I do it this way it does not continue to crawl and seems to stop as soon as the first callback is triggered on each site.

Am I going about this the wrong way, and is there a better way to specify where the spider crawls?

I am using the CrawlSpider type, along with several rules with indivdual callback methods.