I have a reasonably long list of websites that I want to download the
landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than I'd like and I am wondering if wget or an other alternative would be faster given ...
How to scrape a page like this.
https://www.procom.ca/JobList.aspx?keywords=&Cities=&reference=&JobType=0
It is secure, and requires a referrer? I can't get anything using wget or httplib2.
If you go through this page, you get a list and it works on a browser but not the command line.
https://www.procom.ca/jobsearch.aspx
...
I am using Scrapy to fetch some data from iTunes' AppStore database. I start with this list of apps: http://itunes.apple.com/us/genre/mobile-software-applications/id36?mt=8
In the following code I have used the simplest regex which targets all apps in the US store.
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ...
Hi..
Maybe not the correct place to post. But, I'm going to try anyway!
I've got a couple of test python parsing scripts that I created. They work enough for me to test what I'm working on.
However, I recently came across the python framework, Scrapy, which is used for web scraping. My app runs in a distributed process, across a test...
I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command:
scrapy-ctl.py genspider myspider myspdier-domain.com
but it did not work and returns the error:
Error running: scrap...
I have been trying to get a simple spider to run with scrapy, but keep getting the error:
Could not find spider for domain:stackexchange.com
when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow:
from scrapy.spider import BaseSpider
from __future__ import absolute_import
class StackEx...
I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t ...
Hello,
I was wondering if anyone ever tried to extract/follow RSS item links using
SgmlLinkExtractor/CrawlSpider. I can't get it to work...
I am using the following rule:
rules = (
Rule(SgmlLinkExtractor(tags=('link',), attrs=False),
follow=True,
callback='parse_article'),
)
(having in mind ...
Hi,
I need to write a program to scrape forums.
Should I write the program in Python using the Scrapy framework or should I use Php cURL?
Also is there a Php equivalent to Scrapy?
Thanks
...
I'm looking for a way to simulate browser resources expansion behavior.
The flow I'm trying to address is the following:
Access an initial URL (e.g. http://example.dmn/index.htm)
Parse the html response received (e.g. index.htm)
Find the resources that a browser will fetch as a result of the index parsing, e.g.:
Images
Flash
...
Hi all!
I am new to python and scrapy and hence am getting some basic doubts(please spare my ignorance about some fundamentals,which i m willing to learn :D).
Right now I am writing some spiders and implementing them using scrapy-ctl.py from the command line by typing:
C:\Python26\dmoz>python scrapy-ctl.py crawl spider
But I do not w...
Hi all!!!
I am new to python and scrapy .
I am running the scrapy-ctl.py from another python script using
subprocess module.But I want to parse the 'start url' to the spider from
this script itself.Is it possible to parse start_urls(which are
determined in the script from which scrapy-ctl is run) to the spider?
I will be greatful f...
I have been working on a spider that gathers data for research using Scrapy. It crawls around 100 sites that each have a large amount of links within them. I need to specifly were the spider crawls so that I can tell the spider to collect data from certain parts of the site, while not crawling others to save time. I have been having muc...
Hi all!
I have written python scripts that use scrapy,nltk and simplejson in my project but i need to run them from java as my mentor wants to deploy them on a server and i have very less time to do this.I took a glance at runtime.exec() in java and jython, needless to say that running system commands from java doesn't look simple eithe...
I am looking for some example code of a SQLite pipeline in Scrapy. I know there is no built in support for it, but I'm sure it has been done. Only actual code can help me, as I only know enough Python and Scrapy to complete my very limited task, and need the code as a starting point.
...
hi, I recently discovered Scrapy which i find very efficient. However, I really don't see how to embed it in a larger project written in python. I would like to create a spider in the normal way but be able to launch it on a given url with a function
start_crawl(url)
which would launch the crawling process on a given domain and stop o...
Hello. Help please to make from the string like:
<a href="http://testsite.com" class="className">link_text_part1 <em>another_text</em> link_text_part2</a>
string like:
link_text_part1 another_text link_text_part2
using regular expressions in Python
!note testsite.com changes
...
Hello. I'm trying to wrote parsing script using python/scrapy. How can I remove [] and u' from strings in result file?
Now I have text like this:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.utils.markup import remove_tags
from googleparser.items import GoogleparserItem
import sys
clas...
Hello.
I want to parse Google search and get links to RSS from each item from the search results.
I use Scrapy.
I tried this construction,
...
def parse_second(self, response):
hxs = HtmlXPathSelector(response)
qqq = hxs.select('/html/head/link[@type=application/rss+xml]/@href').extract()
print qqq
item = response.req...
For the past month I've been using Scrapy for a web crawling project I've begun.
This project involves pulling down the full document content of all web pages in a single domain name that are reachable from the home page. Writing this using Scrapy was quite easy, but it simply runs too slowly. In 2-3 days I can only pull down 100,000 pa...