ansaurus

Question

Answer 1

A:

SgmlLinkExtractor doesn't support selectors in its "allow" argument.

So this is wrong:

SgmlLinkExtractor(allow=["hxs.select('//td[@class='altRow'] ...')"])

This is right:

SgmlLinkExtractor(allow=[r"product\.php"])

Pablo Hoffman 2009-11-26 20:02:09

Ok, thanks. I simplified SgmLinkExtractor by writing just one name: rules = (Rule(SgmlLinkExtractor(allow=["/aabbas"]), callback='parse'),)but i still get the same "index out of range" error. What do I need to do make this work?

Zeynel 2009-11-26 20:54:10

if allow doesn't allow selectors can I pass to it items from a list?

Zeynel 2009-11-27 01:25:23

like SgmlLinkExtractor(allow=["name"])?where name is "/aabbas"

Zeynel 2009-11-27 01:26:26

Answer 2

A:

The parse function is called for each match of your SgmlLinkExtractor.

As Pablo mentioned you want to simplify your SgmlLinkExtractor.

Mark Ellul 2009-11-26 20:09:07

Ok, as I replied to Pablo, I tried it with just one of the names /aabbas. But I still get the index out of range error. Can you help how I need to rephrase LinkExtractor to work. If it works then I may try to fine-tune it later. Thanks.

Zeynel 2009-11-26 20:57:39

ok, so the match should be /aabbas. What does parse function gets for this match?

Zeynel 2009-11-26 21:27:33

Answer 3

A:

I also tried to put the names scraped from the initial url into a list and then pass each name to parse in the form of absolute url as http://www.whitecase.com/aabbas (for /aabbas).

The following code loops over the list, but I don't know how to pass this to parse . Do you think this is a better idea?

baseurl = 'http://www.whitecase.com'
names = ['aabbas', '/cabel', '/jacevedo', '/jacuna', '/igbadegesin']

def makeurl(baseurl, names):
  for x in names:
      url = baseurl + x
      baseurl = 'http://www.whitecase.com'
      x = ''
      return url

Zeynel 2009-11-26 21:15:23

ansaurus

tags:

views:

answers:

Scrapy spider index error

related questions