I have been working on a spider that gathers data for research using Scrapy. It crawls around 100 sites that each have a large amount of links within them. I need to specifly were the spider crawls so that I can tell the spider to collect data from certain parts of the site, while not crawling others to save time. I have been having much difficulty figuring out how to do this efficiently (All of the sites have the same format and structure, only different domains). At this time, I only want the spider to follow links in specific parts of the websites, but I do not know how to control this well. I have been using SgmlLinkExtractor
Rules along with its functions allow
and restrict_xpaths
to control were the spider crawls thus far, but when I do it this way it does not continue to crawl and seems to stop as soon as the first callback is triggered on each site.
Am I going about this the wrong way, and is there a better way to specify where the spider crawls?
I am using the CrawlSpider type, along with several rules with indivdual callback methods.