Hello,
When I run the spider from the Scrapy tutorial I get these error messages:
File "C:\Python26\lib\site-packages\twisted\internet\base.py", line 374, in fireEvent DeferredList(beforeResults).addCallback(self._continueFiring)
File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 195, in addCallback callbackKeyword...
Hello,
I am studying the Scrapy tutorial. To test the process I created a new project with these files:
See my post in Scrapy group for links to scripts, I cannot post more than 1 link here.
The spider runs well and scrapes the text between title tags and puts it in FirmItem
[whitecase.com] INFO: Passed FirmItem(title=[u'White & ...
I'm working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages events in those categories, and the final, third-level pages participants in the events. I can't predict how many categories, events or pa...
I'm looking for a Ruby library or gem (or set of gems) which will not only do spidering, but also collect the data into, say, a database, and allow basic searches on the data (i.e. a typical web search).
I've found several spidering libraries, so that part seems well covered (I was going to try Anemone first), but I can't find anything ...
This is the code for Spyder1 that I've been trying to write within Scrapy framework:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem
class Spider1(CrawlSpider):...
Hello,
I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:
1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A
2) from initial url pick up these urls with this regex:
hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jac...
This is the BaseSpider example from the Scrapy tutorial:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
domain_name = "dmoz.org"
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http...
Hello,
I am trying to make the SgmlLinkExtractor to work.
This is the signature:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
I am just using allow=()
So, I enter
rules = (Rule(SgmlLinkExtractor(all...
Hello,
In the Scrapy tutorial there is this method of the BaseSpider:
make_requests_from_url(url)
A method that receives a URL and
returns a Request object (or a list of
Request objects) to scrape.
This method is used to construct the
initial requests in the
start_requests() method, and is
typically used to conv...
Hello spider experts:
I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you.
I want to
start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attor...
Please take a look at this spider example in Scrapy documentation. The explanation is:
This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be fi...
Hi all.
I'm looking to add a very simple layer of automated integration testing to our current Continuous Integration setup. (CI currently only checks for build breaks).
Is there a product that will:
From a base URL, spider a site &
report back any 404/500 error codes?
Allow me to add a step to logon, to
be able to spider the authori...
Hello, I'm writing a spider in Python to crawl a site. Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.
What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort thro...
obviously, i think its overkill for me to run a spider that will crawl the internet autonomously like google or yahoos.
so i am wondering, if there is some way i can access a major search engine's database, instead of scraping them ?
...
I ask this because I am creating a spider to collect data from blogger.com for a data visualisation project for university.
The spider will look for about 17,000 values on the browse function of blogger and (anonymously) save certain ones if they fit the right criteria.
I've been running the spider (written in PHP) and it works fine, b...
In Episode 78 of the Joel & Jeff podcast one of the Doctype / Litmus guys states that you would never want to build a spider in ruby. Would anyone like to guess at his reasoning for this?
...
I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do?
Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause.
Restriction...
I am interested in crawling a lot of websites. The most important consideration is that the spider is able to reach as much as the site as possible. One key feature that is lacking from most spiders is the ability to execute JavaScript. This is required in order to crawl ajax powered sites. I really like Open Source and I will need t...
Is there a way to force a spider to slow down its spidering of a wesbite? Anything that can be put in headers or robots.txt?
I thought i remembered reading something about this being possible but cannot find anything now.
...
What would be the shortest method to block * and only allow just Major Search Engines to index the index page of the site only?
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /
Allow: index.html
User-agent: Slurp
Disallow: /
Allow: index.html
User-agent: msn
Disallow: /
Allow: index.html
Would this work?
...