screen-scraping

What's the fastest way to scrape a lot of pages in php?

I have a data aggregator that relies on scraping several sites, and indexing their information in a way that is searchable to the user. I need to be able to scrape a vast number of pages, daily, and I have ran into problems using simple curl requests, that are fairly slow when executed in rapid sequence for a long time (the scraper runs...

Does httplib2 support http proxy at all? Socks proxy works but not http.

Here is my code. I cannot get any http proxy to work. Socks proxy (socks4/5) works fine though. Any ideas why? urllib2 works fine with proxies though. I am confused. Thanks.. Code : 1 import socks 2 import httplib2 3 import BeautifulSoup 4 5 httplib2.debuglevel=4 6 7 http = httplib2.Http(proxy_info = httplib2.ProxyInfo(...

page posting issue when working in Screen Scraping

Hi, I am working on screen scraping and done successfully in 3 websites, I have an issue in last website here is my url, When I hit with my parameter, it is showing result on next page, simply posting to other page and showing the result fine on other page Here is My Test However, when I hit from my application, since here I don't hav...

Nokogiri find only inbound links

I have an html document located on http://somedomain.com/somedir/example.html The document contains of four links: http://otherdomain.com/other.html http://somedomain.com/other.html /only.html test.html How I can get the full urls for the links in the current domain ? I mean I should get: http://somedomain.com/other.html http://...

How does Blippy get its data

I was wondering how Blippy is able to get my data? It requires me to put in my bank name, bank card number and password, so is it doing a simple website scrape by logging in? My bank, however also requires a seperate passphrase as well. How does it get around that? Can urllib and such libraries be used in Python to replicate Blippy fun...

Python Scraper for Javascript?

Hey all, Can anyone direct me to a good Python screen scraping library for javascript code (hopefully one with good documentation/tutorials)? I'd like to see what options are out there, but most of all the easiest to learn with fastest results... wondering if anyone had experience. I've heard some stuff about spidermonkey, but maybe the...

screen scraping

hello folks., i am screen scraping a website which is in danish language.. i am unable to scrape certain characters as like må .. any idea to solve this? thanks ...

How to export scrubyt extractor?

I've written a scrubyt extractor based on the 'learning' technique - that is, specifying the current text on the page and getting it to work out the XPath expressions itself. However, I now want to export the extractor so that it can be used even when the page has changed. The documentation for scrubyt seems to be all over the place now...

Scraping Google docs (can't use API)

I'm building an iPhone app which needs a peice of meta data from a user's Google Spreadsheet. Unfortunately the meta data I need is not exposed by the API, so I will need to scrape it from the document's HTML source (it would not be present in any of the exported variants). Is there anyway to include authentication parameters in a call ...

What's the requests/second standard for scraping websites?

This was the closest question to my question and it wasn't really answered very well imo: http://stackoverflow.com/questions/2022030/web-scraping-etiquette I'm looking for the answer to #1: How many requests/second should you be doing to scrape? Right now I pull from a queue of links. Every site that gets scraped has it's own thread ...

Screen scraping an application window and interacting with the mouse and keyboard

The other day I found myself addicted to a flash game and frustrated by the thing at the same time. In a moment of frustration with the game I thought I would make a 'bot' to beat it for me. Well, I really wouldn't, but it made me realize: I don't know how to interact with another application in a way to do this. Which brings me to th...

Can Mechanize make Javascript calls?

Can Mechanize make Javascript calls? This would be handy to negotiate AJAX when screen-scraping... ...

How to open URLs in rails?

I'm trying to read in the html of a certain website. Trying @something = open("http://www.google.com/") fails with the following error: Errno::ENOENT in testController#show No such file or directory - http://www.google.com/ Going to http://www.google.com/, I obviously see the site. What am I doing wrong? Thanks! ...

Help converting code using httlib2 to use urllib2

What am I trying to do? Visit a site, retrieve cookie, visit the next page by sending in the cookie info. It all works but httplib2 is giving me one too many problems with socks proxy on one site. http = httplib2.Http() main_url = 'http://mywebsite.com/get.aspx?id='+ id +'&rows=25' response, content = http.request(main_url, 'GET', hea...

Data extraction from source with lots of white space

I'm trying to extract data from : http://www.phillysheriff.com/old_site/properties.html Ideally I'd be able to get a CSV file with the address, ward, price, and square feet? Is there an easy way to do this? ...

Displaying scraped results in Django template

I'm test building a scraping site with django. For some reason the following code is only providing one picture image where i'd like it to print every image, every link, and every price, any help? (also, if you guys know how to place this data into a database model so I don't have to always scrape the site, i'm all ears but that may be a...

How to quickly acquire and process real time screen output

I am trying to write a program to play a full screen PC game for fun (as an experiment in Computer Vision and Artificial Intelligence). For this experiment I am assuming the game has no underlying API for AI players (nor is the source available) so I intend to process the visual information rendered by the game on the screen. The game ...

Using Mechanize with Google Docs

I'm trying to use Mechanize login to Google Docs so that I can scrape something (not possible from the API) but I keep seem to keep getting a 404 when trying to follow the meta redirect: require 'rubygems' require 'mechanize' USERNAME = "..." PASSWORD = "..." LOGIN_URL = "https://www.google.com/accounts/Login?hl=en&continue=http:/...

Problem with eastern european characters when scraping data from the European Parliament Website

Dear Experts EDIT: thanks a lot for all the answers an points raised. As a novice I am a bit overwhelmed, but it is a great motivation for continuing learning python!! I am trying to scrape a lot of data from the European Parliament website for a research project. The first step is to create a list of all parliamentarians, however due ...

How can I get all content within <td> tag using a HTML Agility Pack?

So I'm writing an application that will do a little screen scrapping. I'm using the HTML Agility Pack to load an entire HTML page into an instance of HtmlDocoument called doc. Now I want to parse that doc, looking for this: <table border="0" cellspacing="3"> <tr><td>First rows stuff</td></tr> <tr> <td> The data I want is in here <br />...