spider

Twisted errors in Scrapy spider

Hello, When I run the spider from the Scrapy tutorial I get these error messages: File "C:\Python26\lib\site-packages\twisted\internet\base.py", line 374, in fireEvent DeferredList(beforeResults).addCallback(self._continueFiring) File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 195, in addCallback callbackKeyword...

Newbie Q about Scrapy pipeline.py

Hello, I am studying the Scrapy tutorial. To test the process I created a new project with these files: See my post in Scrapy group for links to scripts, I cannot post more than 1 link here. The spider runs well and scrapes the text between title tags and puts it in FirmItem [whitecase.com] INFO: Passed FirmItem(title=[u'White & ...

Designing a multi-process spider in Python

I'm working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages events in those categories, and the final, third-level pages participants in the events. I can't predict how many categories, events or pa...

Ruby web spider & search engine library

I'm looking for a Ruby library or gem (or set of gems) which will not only do spidering, but also collect the data into, say, a database, and allow basic searches on the data (i.e. a typical web search). I've found several spidering libraries, so that part seems well covered (I was going to try Anemone first), but I can't find anything ...

Scrapy spider index error

This is the code for Spyder1 that I've been trying to write within Scrapy framework: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from firm.items import FirmItem class Spider1(CrawlSpider):...

How to write a simple spider in Python?

Hello, I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python: 1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A 2) from initial url pick up these urls with this regex: hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') [u'/cabel', u'/jac...

Scrapy BaseSpider: How does it work?

This is the BaseSpider example from the Scrapy tutorial: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): domain_name = "dmoz.org" start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http...

Scrapy SgmlLinkExtractor question

Hello, I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None) I am just using allow=() So, I enter rules = (Rule(SgmlLinkExtractor(all...

Scrapy make_requests_from_url(url)

Hello, In the Scrapy tutorial there is this method of the BaseSpider: make_requests_from_url(url) A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. This method is used to construct the initial requests in the start_requests() method, and is typically used to conv...

A simple spider question

Hello spider experts: I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you. I want to start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attor...

Scrapy SgmlLinkExtractor is ignoring allowed links

Please take a look at this spider example in Scrapy documentation. The explanation is: This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be fi...

Automated spider test

Hi all. I'm looking to add a very simple layer of automated integration testing to our current Continuous Integration setup. (CI currently only checks for build breaks). Is there a product that will: From a base URL, spider a site & report back any 404/500 error codes? Allow me to add a step to logon, to be able to spider the authori...

Writing a Faster Python Spider

Hello, I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed. What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort thro...

need access to a search engine's database

obviously, i think its overkill for me to run a spider that will crawl the internet autonomously like google or yahoos. so i am wondering, if there is some way i can access a major search engine's database, instead of scraping them ? ...

Will <insert popular website here> restrict me from accessing their website if I request it too many times?

I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university. The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria. I've been running the spider (written in PHP) and it works fine, b...

Why should Ruby not be used to create a spider

In Episode 78 of the Joel & Jeff podcast one of the Doctype / Litmus guys states that you would never want to build a spider in ruby. Would anyone like to guess at his reasoning for this? ...

Legality, terms of service for performing a web crawl

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restriction...

Best Open Source Spider for Site Coverage.

I am interested in crawling a lot of websites. The most important consideration is that the spider is able to reach as much as the site as possible. One key feature that is lacking from most spiders is the ability to execute JavaScript. This is required in order to crawl ajax powered sites. I really like Open Source and I will need t...

Slow down spidering of website

Is there a way to force a spider to slow down its spidering of a wesbite? Anything that can be put in headers or robots.txt? I thought i remembered reading something about this being possible but cannot find anything now. ...

Allow SE indexing on index.html only.

What would be the shortest method to block * and only allow just Major Search Engines to index the index page of the site only? User-agent: * Disallow: / User-agent: Googlebot Disallow: / Allow: index.html User-agent: Slurp Disallow: / Allow: index.html User-agent: msn Disallow: / Allow: index.html Would this work? ...