ansaurus

Question

Using one Scrapy spider for several websites

Answer 1

A:

What are you scraping the web for if you don't mind explaining

Taos 2010-03-07 14:26:48

I'm not. I'm writing a simple, configurable spider that can be used to spider a small set of specific websites. When new websites on the same topic appears, it should be simple to add to the set.

Christian Davén 2010-03-07 14:33:01

Well the think is you might not need to spider it, if you know what your looking for you can just crimp the html. Or read the xml, I did the same when I wrote my tv scheduler

Taos 2010-03-07 14:51:15

Answer 2

+1 A:

shameless self promotion on domo (http://github.com/hinoglu/Domo/)! you'll need to instantiate the crawler as given in the examples, for your project.

also you'll need to make the crawler configurable on runtime, which is simply passing the configuration to crawler, and overriding the settings on runtime, when configuration changed.

hinoglu 2010-03-07 16:47:14

Answer 3

+1 A:

What you need is to dynamically create spider classes, subclassing your favorite generic spider class as supplied by scrapy (CrawlSpider subclasses with your rules added, or XmlFeedSpider, or whatever) and adding domain_name, start_urls, and possibly extra_domain_names (and/or start_requests(), etc), as you get or deduce them from your GUI (or config file, or whatever).

Python makes it easy to perform such dynamic creation of class objects; a very simple example might be:

from scrapy import spider

def makespider(domain_name, start_urls,
               basecls=spider.BaseSpider):
  return type(domain_name + 'Spider',
              (basecls,),
              {'domain_name': domain_name,
               'start_urls': start_urls})

allspiders = []
for domain, urls in listofdomainurlpairs:
  allspiders.append(makespider(domain, urls))

This gives you a list of very bare-bone spider classes -- you'll probably want to add parse methods to them before you instantiate them. Season to taste...;-).

Alex Martelli 2010-03-07 17:17:02

Answer 4

+4 A:

Override default SpiderManager class, load your custom rules from a database or somewhere else and instanciate a custom spider with your own rules/regexes and domain_name

in mybot/settings.py:

SPIDER_MANAGER_CLASS = 'mybot.spidermanager.MySpiderManager'

in mybot/spidermanager.py:

from mybot.spider import MyParametrizedSpider

class MySpiderManager(object):
    loaded = True

    def fromdomain(self, name):
        start_urls, extra_domain_names, regexes = self._get_spider_info(name)
        return MyParametrizedSpider(name, start_urls, extra_domain_names, regexes)

    def close_spider(self, spider):
        # Put here code you want to run before spiders is closed
        pass

    def _get_spider_info(self, name):
        # query your backend (maybe a sqldb) using `name` as primary key, 
        # and return start_urls, extra_domains and regexes
        ...
        return (start_urls, extra_domains, regexes)

and now your custom spider class, in mybot/spider.py:

from scrapy.spider import BaseSpider

class MyParametrizedSpider(BaseSpider):

    def __init__(self, name, start_urls, extra_domain_names, regexes):
        self.domain_name = name
        self.start_urls = start_urls
        self.extra_domain_names = extra_domain_names
        self.regexes = regexes

     def parse(self, response):
         ...

Notes:

You can extend CrawlSpider too if you want to take advantage of its Rules system
To run a spider use: ./scrapy-ctl.py crawl <name>, where name is passed to SpiderManager.fromdomain and is the key to retreive more spider info from the backend system
As solution overrides default SpiderManager, coding a classic spider (a python module per SPIDER) doesn't works, but, I think this is not an issue for you. More info on default spiders manager TwistedPluginSpiderManager

dangra 2010-03-07 18:19:52

The difference with Alex Martelli approach is that spiders are instantiated on demand, instead of preinstantiating all of them just to use one. This approach can reduce the load on your backend and the memory footprint of your scrapy bot process.

dangra 2010-03-07 18:57:20

ansaurus

tags:

views:

answers:

Using one Scrapy spider for several websites

related questions