I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user.
I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs...
Here is my code. I cannot get any http proxy to work. Socks proxy (socks4/5) works fine though. Any ideas why? urllib2 works fine with proxies though. I am confused. Thanks..
Code :
1 import socks
2 import httplib2
3 import BeautifulSoup
4
5 httplib2.debuglevel=4
6
7 http = httplib2.Http(proxy_info = httplib2.ProxyInfo(...
Hi, I am working on screen scraping and done successfully in 3 websites, I have an issue in last website
here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page
Here is My Test
However, when I hit from my application, since here I don't hav...
I have an html document located on http://somedomain.com/somedir/example.html
The document contains of four links:
http://otherdomain.com/other.html
http://somedomain.com/other.html
/only.html
test.html
How I can get the full urls for the links in the current domain ?
I mean I should get:
http://somedomain.com/other.html
http://...
I was wondering how Blippy is able to get my data? It requires me to put in my bank name, bank card number and password, so is it doing a simple website scrape by logging in?
My bank, however also requires a seperate passphrase as well. How does it get around that?
Can urllib and such libraries be used in Python to replicate Blippy fun...
Hey all,
Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe the...
hello folks.,
i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må ..
any idea to solve this?
thanks
...
I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed.
The documentation for scrubyt seems to be all over the place now...
I'm building an iPhone app which needs a peice of meta data from a user's Google Spreadsheet. Unfortunately the meta data I need is not exposed by the API, so I will need to scrape it from the document's HTML source (it would not be present in any of the exported variants).
Is there anyway to include authentication parameters in a call ...
This was the closest question to my question and it wasn't really answered very well imo:
http://stackoverflow.com/questions/2022030/web-scraping-etiquette
I'm looking for the answer to #1:
How many requests/second should you be doing to scrape?
Right now I pull from a queue of links. Every site that gets scraped has it's own thread ...
The other day I found myself addicted to a flash game and frustrated by the thing at the same time. In a moment of frustration with the game I thought I would make a 'bot' to beat it for me. Well, I really wouldn't, but it made me realize: I don't know how to interact with another application in a way to do this. Which brings me to th...
Can Mechanize make Javascript calls?
This would be handy to negotiate AJAX when screen-scraping...
...
I'm trying to read in the html of a certain website.
Trying @something = open("http://www.google.com/") fails with the following error:
Errno::ENOENT in testController#show
No such file or directory - http://www.google.com/
Going to http://www.google.com/, I obviously see the site. What am I doing wrong?
Thanks!
...
What am I trying to do?
Visit a site, retrieve cookie, visit the next page by sending in the cookie info. It all works but httplib2 is giving me one too many problems with socks proxy on one site.
http = httplib2.Http()
main_url = 'http://mywebsite.com/get.aspx?id='+ id +'&rows=25'
response, content = http.request(main_url, 'GET', hea...
I'm trying to extract data from : http://www.phillysheriff.com/old_site/properties.html
Ideally I'd be able to get a CSV file with the address, ward, price, and square feet? Is there an easy way to do this?
...
I'm test building a scraping site with django. For some reason the following code is only providing one picture image where i'd like it to print every image, every link, and every price, any help? (also, if you guys know how to place this data into a database model so I don't have to always scrape the site, i'm all ears but that may be a...
I am trying to write a program to play a full screen PC game for fun (as an experiment in Computer Vision and Artificial Intelligence).
For this experiment I am assuming the game has no underlying API for AI players (nor is the source available) so I intend to process the visual information rendered by the game on the screen.
The game ...
I'm trying to use Mechanize login to Google Docs so that I can scrape something (not possible from the API) but I keep seem to keep getting a 404 when trying to follow the meta redirect:
require 'rubygems'
require 'mechanize'
USERNAME = "..."
PASSWORD = "..."
LOGIN_URL = "https://www.google.com/accounts/Login?hl=en&continue=http:/...
Dear Experts
EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!!
I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due ...
So I'm writing an application that will do a little screen scrapping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this:
<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td>
The data I want is in here <br />...