scrapy

scrapy unknown scheduler middleware recursion problem

Dear everyone, I am using scrapy for scrapping I decided to write my own scheduler middleware to store some request to reduce the size of that within memory. Here is my code: def enqueue_request_into_scheduler(self, spider, request): print "ENQUEUE SCHEDULER with request %s" % str(request) scrapyengine.scheduler.enqueue_reques...

Most optimized way to store crawler states ?

Hi there, I'm currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system. The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are. Thus, I'm able to fetch those links (obviousl...

Twisted errors in Scrapy spider

Hello, When I run the spider from the Scrapy tutorial I get these error messages: File "C:\Python26\lib\site-packages\twisted\internet\base.py", line 374, in fireEvent DeferredList(beforeResults).addCallback(self._continueFiring) File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 195, in addCallback callbackKeyword...

Newbie Q about Scrapy pipeline.py

Hello, I am studying the Scrapy tutorial. To test the process I created a new project with these files: See my post in Scrapy group for links to scripts, I cannot post more than 1 link here. The spider runs well and scrapes the text between title tags and puts it in FirmItem [whitecase.com] INFO: Passed FirmItem(title=[u'White & ...

Scrapy domain_name for spider

From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem is I had a spider that worked with...

Scrapy spider index error

This is the code for Spyder1 that I've been trying to write within Scrapy framework: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from firm.items import FirmItem class Spider1(CrawlSpider):...

How to write a simple spider in Python?

Hello, I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python: 1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A 2) from initial url pick up these urls with this regex: hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') [u'/cabel', u'/jac...

Scrapy BaseSpider: How does it work?

This is the BaseSpider example from the Scrapy tutorial: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from dmoz.items import DmozItem class DmozSpider(BaseSpider): domain_name = "dmoz.org" start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http...

Scrapy spider is not working

Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelect...

Scrapy SgmlLinkExtractor question

Hello, I am trying to make the SgmlLinkExtractor to work. This is the signature: SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None) I am just using allow=() So, I enter rules = (Rule(SgmlLinkExtractor(all...

Scrapy make_requests_from_url(url)

Hello, In the Scrapy tutorial there is this method of the BaseSpider: make_requests_from_url(url) A method that receives a URL and returns a Request object (or a list of Request objects) to scrape. This method is used to construct the initial requests in the start_requests() method, and is typically used to conv...

Scrapy SgmlLinkExtractor is ignoring allowed links

Please take a look at this spider example in Scrapy documentation. The explanation is: This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be fi...

How to remove expired items from database with Scrapy

I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items. Strategies to detect if an item is expired are: Spider the site's "delete.rss". Every few days, try reloading the contents page and making sure it still works. Spider every ...

scrape html generated by javascript with python

I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this butt...

How to build a web crawler based on Scrapy to run forever?

I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be: Run forever Means it will periodical re-visit some portal pages to get updates. Schedule priorities. Give different priorities to different type of URLs. Multi thread fetch I've read the Scrapy docum...

Scrapy install: no acceptable C compiler found in $PATH

I am trying to install Scrapy on a a Mac OS X 10.6.2 machine... When i try to build one of the dependent modules ( libxml2 ) I am getting the following error: configure: error: no acceptable C compiler found in $PATH I assume I need the gcc...is that easy to install on 10.6? Is there some sort of package I should be installing, so ...

Error installing scrapy on Mac Os X 10.6

Trying to install Scrapy on Mac OSX 10.6 using this guide: When running these commands from Terminal: cd libxml2-2.7.3/python sudo make install I get the following error: Making install in . make[1]: *** No rule to make target `../libxslt/libxslt.la', needed by `libxsltmod.la'. Stop. make: *** [install-recursive] Error 1 Followin...

Using one Scrapy spider for several websites

I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI. How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamical...

How to use CrawlSpider from scrapy to click a link with javascript onclick?

I want scrapy to crawl pages where going to the next one link looks like this: Next Will scrapy be able to interpret javascript code of that? With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n ...

Get document DOCTYPE with BeautifulSoup

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object. Given the fllowing html: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org...