Dear everyone,
I am using scrapy for scrapping
I decided to write my own scheduler middleware to store some request to reduce the size of that within memory.
Here is my code:
def enqueue_request_into_scheduler(self, spider, request):
print "ENQUEUE SCHEDULER with request %s" % str(request)
scrapyengine.scheduler.enqueue_reques...
Hi there,
I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviousl...
Hello,
When I run the spider from the Scrapy tutorial I get these error messages:
File "C:\Python26\lib\site-packages\twisted\internet\base.py", line 374, in fireEvent DeferredList(beforeResults).addCallback(self._continueFiring)
File "C:\Python26\lib\site-packages\twisted\internet\defer.py", line 195, in addCallback callbackKeyword...
Hello,
I am studying the Scrapy tutorial. To test the process I created a new project with these files:
See my post in Scrapy group for links to scripts, I cannot post more than 1 link here.
The spider runs well and scrapes the text between title tags and puts it in FirmItem
[whitecase.com] INFO: Passed FirmItem(title=[u'White & ...
From the Scrapy tutorial:
domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders.
Does this mean that domain_name must be a valid domain name, like
domain_name = 'example.com'
Or can I name
domain_name = 'ex1'
The problem is I had a spider that worked with...
This is the code for Spyder1 that I've been trying to write within Scrapy framework:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from firm.items import FirmItem
class Spider1(CrawlSpider):...
Hello,
I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:
1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A
2) from initial url pick up these urls with this regex:
hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')
[u'/cabel', u'/jac...
This is the BaseSpider example from the Scrapy tutorial:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dmoz.items import DmozItem
class DmozSpider(BaseSpider):
domain_name = "dmoz.org"
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http...
Since nothing so far is working I started a new project with
python scrapy-ctl.py startproject Nu
I followed the tutorial exactly, and created the folders, and a new spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelect...
Hello,
I am trying to make the SgmlLinkExtractor to work.
This is the signature:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
I am just using allow=()
So, I enter
rules = (Rule(SgmlLinkExtractor(all...
Hello,
In the Scrapy tutorial there is this method of the BaseSpider:
make_requests_from_url(url)
A method that receives a URL and
returns a Request object (or a list of
Request objects) to scrape.
This method is used to construct the
initial requests in the
start_requests() method, and is
typically used to conv...
Please take a look at this spider example in Scrapy documentation. The explanation is:
This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and a Item will be fi...
I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items.
Strategies to detect if an item is expired are:
Spider the site's "delete.rss".
Every few days, try reloading the contents page and making sure it still works.
Spider every ...
I need to scrape a site with python. I obtain the source html code with the urlib module, but I need to scrape also some html code that is generated by a javascript function (which is included in the html source). What this functions does "in" the site is that when you press a button it outputs some html code. How can I "press" this butt...
I want to build a web crawler based on Scrapy to grab news pictures from several news portal website. I want to this crawler to be:
Run forever
Means it will periodical re-visit some portal pages to get updates.
Schedule priorities.
Give different priorities to different type of URLs.
Multi thread fetch
I've read the Scrapy docum...
I am trying to install Scrapy on a a Mac OS X 10.6.2 machine...
When i try to build one of the dependent modules ( libxml2 )
I am getting the following error:
configure: error: no acceptable C compiler found in $PATH
I assume I need the gcc...is that easy to install on 10.6? Is there some sort of package I should be installing, so ...
Trying to install Scrapy on Mac OSX 10.6 using this guide:
When running these commands from Terminal:
cd libxml2-2.7.3/python
sudo make install
I get the following error:
Making install in .
make[1]: *** No rule to make target `../libxslt/libxslt.la', needed by `libxsltmod.la'. Stop.
make: *** [install-recursive] Error 1
Followin...
I need to create a user configurable web spider/crawler, and I'm thinking about using Scrapy. But, I can't hard-code the domains and allowed URL regex:es -- this will instead be configurable in a GUI.
How do I (as simple as possible) create a spider or a set of spiders with Scrapy where the domains and allowed URL regex:es are dynamical...
I want scrapy to crawl pages where going to the next one link looks like this:
Next
Will scrapy be able to interpret javascript code of that?
With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this: encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n
...
I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.
Given the fllowing html:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org...