scrapy

Problems installing libxml2 on Mac OS X

I'm trying to install libxml2 on my Mac (OS 10.6.4). I'm actually trying to just run a Scrapy script in Python, which has required me to install Twisted, Zope, and now libxml2. I've downloaded the latest version (2.7.7, from xmlsoft.org) and tried following these instructions here. To summarize, I tried this command (in the python sub...

scrapy web scraper can not crawl link

Hi, I'm very new to Scrapy. Here my spider to crawl twistedweb. class TwistedWebSpider(BaseSpider): name = "twistedweb3" allowed_domains = ["twistedmatrix.com"] start_urls = [ "http://twistedmatrix.com/documents/current/web/howto/", ] rules = ( Rule(SgmlLinkExtractor(), 'parse', follow=True, ), ) def parse...

How to use Scrapy

Hello, I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example: /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list directory.google.com /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl I hacked the code from spiders/google_di...

How can I package a scrapy project using cxfreeze?

I have a scrapy project that I would like to package all together for a customer using windows without having to manually install dependencies for them. I came across cxfreeze, but I'm not quite sure how it would work with a scrapy project. I'm thinking I would make some sort of interface and run the scrapy crawler with 'from scrapy.cmd...

How to install libxml2 in virtualenv?

I have virtualenv with --no-site-packages option. I'm using scrapy in it. Scrapy uses libxml2 by import libxml2. How to install libxml2 in virtualenv using pip or easy_install? ...

Scrapy - how to identify already scraped urls

Hi, Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor. -Avinash ...

How to use python for a webservice

Hi, I am really new to python, just played around with the scrapy framework that is used to crawl websites and extract data. My question is, how to I pass parameters to a python script that is hosted somewhere online. E.g. I make following request mysite.net/rest/index.py Now I want to pass some parameters similar to php like *.php?i...

Passing arguments inside Scrapy spider through lambda callbacks

HI, I'm have this short spider code: class TestSpider(CrawlSpider): name = "test" allowed_domains = ["google.com", "yahoo.com"] start_urls = [ "http://google.com" ] def parse2(self, response, i): print "page2, i: ", i # traceback.print_stack() def parse(self, response): for i ...

web server returns "500 Internal Server Error" after sending this FormRequest using Scrapy

I construct the following FormRequest according to httpFox(Firefox addon)'s content. However, web server alway returns "500 Internal Server Error". Could someone help me on this? The original url is: http://www.intel.com/jobs/jobsearch/index_ne.htm?Location=200000008 Here is my spider's skeleton: class IntelSpider(BaseSpider): ...