views:

376

answers:

3

Hello,

I am trying to make the SgmlLinkExtractor to work.

This is the signature:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

I am just using allow=()

So, I enter

rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

So, the initial url is 'http://www.whitecase.com/jacevedo/' and I am entering allow=('/aadler',) and expect that '/aadler/' will get scanned as well. But instead, the spider scans the initial url and then closes:

[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)

What am I doing wrong here?

Is there anyone here who used Scrapy successfully who can help me to finish this spider?

Thank you for the help.

I include the code for the spider below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from Nu.items import NuItem
from urls import u

class NuSpider(CrawlSpider):
    domain_name = "wcase"
    start_urls = ['xxxxxx/jacevedo/']

    rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        item = NuItem()
        item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)')
        return item

SPIDER = NuSpider()

Note: SO will not let me post more than 1 url so substitute the initial url as necessary. Sorry about that.

A: 

allow=(r'/aadler/', ...

anibal
ok, but nothing changed. Still crawls only the initial url.
Zeynel
A: 

I figured that follow=True needs to be stated:

rules = (Rule(SgmlLinkExtractor(allow=('/careers/n.\w+', )), callback='parse', follow=True))

But now I get:

File "C:\Python26\lib\site-packages\scrapy\contrib\spiders\crawl.py", line 132, in _compile_rules
    self._rules = [copy.copy(r) for r in self.rules]

TypeError: 'Rule' object is not iterable


Take a look at this sample spider here:

rules = (
        # This rule will allow to scrape the start response and get the letter links
        Rule(SgmlLinkExtractor(allow=(r'/members/list$', )), 'parse_items', follow=True),
        # This rule allows to go deep into pages by letters and its many chunks
        Rule(SgmlLinkExtractor(allow=(r'/members\?letter=.', )), 'parse_items', follow=True),
)

I am doing the same thing. In the initial url there is a link for /carrers/northamerica/ and allow=(r'/carrers/n.\w+') is supposed to get that link.

Any suggestions?

Zeynel
+1  A: 

You are overriding the "parse" method it appears. "parse", is a private method in CrawlSpider used to follow links.

Andrew McCloud
Do you mean this line:callback='parse'
Zeynel
Yes. Do not use the callback "parse" in your CrawlSpider Rule.
Andrew McCloud