scrapy

Scraping landing pages of a list of domains

I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given ...

Scraping a page from a secure URL which is possibly using a session ID

How to scrape a page like this. https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0 It is secure, and requires a referrer? I can't get anything using wget or httplib2. If you go through this page, you get a list and it works on a browser but not the command line. https://www.procom.ca/jobsearch.aspx ...

Scrapy issue with iTunes' AppStore

I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8 In the following code I have used the simplest regex which targets all apps in the US store. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from ...

scrapy - python question

Hi.. Maybe not the correct place to post. But, I'm going to try anyway! I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on. However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs in a distributed process, across a test...

Creating a spider using Scrapy, Spider generation error.

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrap...

Scrapy Could not find spider Error

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackEx...

Extra characters Extracted with XPath and Python (html)

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t ...

Scrapy - Follow RSS links

Hello, I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work... I am using the following rule: rules = ( Rule(SgmlLinkExtractor(tags=('link',), attrs=False), follow=True, callback='parse_article'), ) (having in mind ...

Writing a program to scrape forums

Hi, I need to write a program to scrape forums. Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy? Thanks ...

Simulate Browser Resources Expansion Behavior With Python

I'm looking for a way to simulate browser resources expansion behavior. The flow I'm trying to address is the following: Access an initial URL (e.g. http://example.dmn/index.htm) Parse the html response received (e.g. index.htm) Find the resources that a browser will fetch as a result of the index parsing, e.g.: Images Flash ...

how to integrate spiders and scrapy-ctl.py

Hi all! I am new to python and scrapy and hence am getting some basic doubts(please spare my ignorance about some fundamentals,which i m willing to learn :D). Right now I am writing some spiders and implementing them using scrapy-ctl.py from the command line by typing: C:\Python26\dmoz>python scrapy-ctl.py crawl spider But I do not w...

how to parse a string to spider from another script

Hi all!!! I am new to python and scrapy . I am running the scrapy-ctl.py from another python script using subprocess module.But I want to parse the 'start url' to the spider from this script itself.Is it possible to parse start_urls(which are determined in the script from which scrapy-ctl is run) to the spider? I will be greatful f...

What is the best way to control were Scrapy Crawls when collecting large amounts of specific data from many different sites?

I have been working on a spider that gathers data for research using Scrapy. It crawls around 100 sites that each have a large amount of links within them. I need to specifly were the spider crawls so that I can tell the spider to collect data from certain parts of the site, while not crawling others to save time. I have been having muc...

how can we run python script(which uses nltk and scrapy) from java

Hi all! I have written python scripts that use scrapy,nltk and simplejson in my project but i need to run them from java as my mentor wants to deploy them on a server and i have very less time to do this.I took a glance at runtime.exec() in java and jython, needless to say that running system commands from java doesn't look simple eithe...

Does anyone have example code for a sqlite pipeline in Scrapy?

I am looking for some example code of a SQLite pipeline in Scrapy. I know there is no built in support for it, but I'm sure it has been done. Only actual code can help me, as I only know enough Python and Scrapy to complete my very limited task, and need the code as a starting point. ...

Python function based on Scrapy to crawl entirely a web site

hi, I recently discovered Scrapy which i find very efficient. However, I really don't see how to embed it in a larger project written in python. I would like to create a spider in the normal way but be able to launch it on a given url with a function start_crawl(url) which would launch the crawling process on a given domain and stop o...

Need help with the regular expressions in Python

Hello. Help please to make from the string like: <a href="http://testsite.com" class="className">link_text_part1 <em>another_text</em> link_text_part2</a> string like: link_text_part1 another_text link_text_part2 using regular expressions in Python !note testsite.com changes ...

How to remove u'' from python script result?

Hello. I'm trying to wrote parsing script using python/scrapy. How can I remove [] and u' from strings in result file? Now I have text like this: from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.utils.markup import remove_tags from googleparser.items import GoogleparserItem import sys clas...

How to parse RSS link (get ulr to RSS) from the page in Python framework Scrapy?

Hello. I want to parse Google search and get links to RSS from each item from the search results. I use Scrapy. I tried this construction, ... def parse_second(self, response): hxs = HtmlXPathSelector(response) qqq = hxs.select('/html/head/link[@type=application/rss+xml]/@href').extract() print qqq item = response.req...

Best web graph crawler for speed?

For the past month I've been using Scrapy for a web crawling project I've begun. This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...