
Problems installing libxml2 on Mac OS X

I'm trying to install libxml2 on my Mac (OS 10.6.4). I'm actually trying to just run a Scrapy script in Python, which has required me to install Twisted, Zope, and now libxml2. I've downloaded the latest version (2.7.7, from and tried following these instructions here. To summarize, I tried this command (in the python sub...

scrapy web scraper can not crawl link

Hi, I'm very new to Scrapy. Here my spider to crawl twistedweb. class TwistedWebSpider(BaseSpider): name = "twistedweb3" allowed_domains = [""] start_urls = [ "", ] rules = ( Rule(SgmlLinkExtractor(), 'parse', follow=True, ), ) def parse...

How to use Scrapy

Hello, I would like to know how can I start a crawler based on Scrapy. I installed the tool via apt-get install and I tried to run an example: /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy list /usr/share/doc/scrapy/examples/googledir/googledir$ scrapy crawl I hacked the code from spiders/google_di...

How can I package a scrapy project using cxfreeze?

I have a scrapy project that I would like to package all together for a customer using windows without having to manually install dependencies for them. I came across cxfreeze, but I'm not quite sure how it would work with a scrapy project. I'm thinking I would make some sort of interface and run the scrapy crawler with 'from scrapy.cmd...

How to install libxml2 in virtualenv?

I have virtualenv with --no-site-packages option. I'm using scrapy in it. Scrapy uses libxml2 by import libxml2. How to install libxml2 in virtualenv using pip or easy_install? ...

Scrapy - how to identify already scraped urls

Hi, Im using scrapy to crawl a news website on a daily basis. How do i restrict scrapy from scraping already scraped URLS. Also is there any clear documentation or examples on SgmlLinkExtractor. -Avinash ...

How to use python for a webservice

Hi, I am really new to python, just played around with the scrapy framework that is used to crawl websites and extract data. My question is, how to I pass parameters to a python script that is hosted somewhere online. E.g. I make following request Now I want to pass some parameters similar to php like *.php?i...

Passing arguments inside Scrapy spider through lambda callbacks

HI, I'm have this short spider code: class TestSpider(CrawlSpider): name = "test" allowed_domains = ["", ""] start_urls = [ "" ] def parse2(self, response, i): print "page2, i: ", i # traceback.print_stack() def parse(self, response): for i ...

web server returns "500 Internal Server Error" after sending this FormRequest using Scrapy

I construct the following FormRequest according to httpFox(Firefox addon)'s content. However, web server alway returns "500 Internal Server Error". Could someone help me on this? The original url is: Here is my spider's skeleton: class IntelSpider(BaseSpider): ...