ansaurus

Question

Scrapy SgmlLinkExtractor question

Answer 1

A:

allow=(r'/aadler/', ...

anibal 2009-11-27 18:48:51

ok, but nothing changed. Still crawls only the initial url.

Zeynel 2009-11-27 19:07:26

Answer 2

A:

I figured that follow=True needs to be stated:

rules = (Rule(SgmlLinkExtractor(allow=('/careers/n.\w+', )), callback='parse', follow=True))

But now I get:

File "C:\Python26\lib\site-packages\scrapy\contrib\spiders\crawl.py", line 132, in _compile_rules
    self._rules = [copy.copy(r) for r in self.rules]

TypeError: 'Rule' object is not iterable

Take a look at this sample spider here:

rules = (
        # This rule will allow to scrape the start response and get the letter links
        Rule(SgmlLinkExtractor(allow=(r'/members/list$', )), 'parse_items', follow=True),
        # This rule allows to go deep into pages by letters and its many chunks
        Rule(SgmlLinkExtractor(allow=(r'/members\?letter=.', )), 'parse_items', follow=True),
)

I am doing the same thing. In the initial url there is a link for /carrers/northamerica/ and allow=(r'/carrers/n.\w+') is supposed to get that link.

Any suggestions?

Zeynel 2009-11-27 20:03:42

Answer 3

+1 A:

You are overriding the "parse" method it appears. "parse", is a private method in CrawlSpider used to follow links.

Andrew McCloud 2009-11-30 08:33:50

Do you mean this line:callback='parse'

Zeynel 2009-11-30 13:08:36

Yes. Do not use the callback "parse" in your CrawlSpider Rule.

Andrew McCloud 2009-12-02 01:45:31

ansaurus

tags:

views:

answers:

Scrapy SgmlLinkExtractor question

related questions