screen-scraping

Scrape web page contents

I've just started looking into this, I want to scrape my Netgear Router (http://192.168.0.1/setup.cgi?next_file=stattbl.htm) stats into a csv file. I run Win & Linux, but mainly know C++, any links/solutions? ...

Has anybody ever tried to screen scrape data from sites built with SharePoint?

Or at least could anybody point me to docs about its crazy proprietary url parameters and html field name obfuscation? I can only suppose this is caused by SharePoint... The main problem is, given a start page built with SharePoint, I can't recreate a form post with a programmative client because: field names vary, they are appended w...

Non-trivial screen scraping selections using pQuery

I'm using pQuery (a Perl port of jQuery) to select elements and retrieve text from a HTML-document. Consider the following markup: <x> <y>code1</y> <z>stuff</z> <y>code2</y> <z>foobar</z> </x> And the following pQuery code: my $target_value = pQuery($markup)->find($pquery_selector)->text; I'm trying to formulate $pquer...

How to get the number of results found for a keyword in google

I need to supply a keyword like "blue metal kettle" (with/without quotes) and get only the number of results found for this search. If I search without quotes right now, I get: Results 1 - 10 of about 1,040,000 for blue metal kettle. (0.19 seconds) Here '1,040,000' is the number I want. Is there any API function to do this, or I must...

Simple screen scraping and analyze in .NET

I'm building a small specialized search engine for prise info. The engine will only collect specific segments of data on each site. My plan is to split the process into two steps. Simple screen scraping based on a URL that points to the page where the segment I need exists. Is the easiest way to do this just to use a WebClient object a...

Download web page with images and stylesheets and (optionally) E-mailing it

I need to make snapshots of web pages programmatically using PHP and get them into a HTML E-Mail. I tried wget --page-requisites. It downloads everything all right, but it doesn't change the HTML page's source code to point to the downloaded files rather than the on-line originals. Also, that HTML is of course a long way from being dis...

Beautifulsoup get value in table

I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name"))...

Extracting html elements in a given region?

Given a region defined by a rectangle and a url, is there any way to determine what elements lie within the given rectangle on the page at the given url? EDIT: Screen resolution, Font size, etc.. can all be set to reasonable defaults. ...

How can I read and parse the contents of a webpage in R

I'd like to read the contents of a URL (e.q., http://www.haaretz.com/) in R. I am wondering how I can do it ...

Screen scraping HTTPS using C#

How to screen scrape HTTPS using C#? ...

.Net Screen scraping and session

I am trying to screen scrape using C#.It works for few times,after which i receive Session expired error.Any help will be appreciated. ...

unit tests for screen-scraping?

I'm new to unit testing so I'd like to get the opinion of some who are a little more clued-in. I need to write some screen-scraping code shortly. The target system is a web ui where there'll be copious HTML parsing and similar volatile goodness involved. I'll never be notified of any changes by the target system (e.g. they put a redes...

In asp.net how to screen scrape multiple records when paging is implemented for the results?

I know how to screen scrap a page and read the data. But,I need help on how to get all results when paged. Will HTML Agility Pack help in this issue or any other tools available for this or any other way? ...

How can I scrape the content present on a WAVE validator page to appear on my page?

I've tried fopen, fread, file_get_contents, curl, and none of those work. I keep getting Forbidden errors. There has got to be a way around it. Anyone? ...

Regex HTML Extraction C#

I have searched and searched about Regex but I can't seem to find something that will allow me to do this. I need to get the 12.32, 2,300, 4.644 M and 12,444.12 from the following strings in C#: <td class="c-ob-j1a" property="c-value">12.32</td> <td class="c-ob-j1a" property="c-value">2,300</td> <td class="c-ob-j1a" property="c-value">...

Trouble Scraping Web Page With Malformed Content

I have written c# code which utilizes the HtmlAgilityPack library in order to scrape a page located at: World's Largest Urban Areas (Page 2). Unfortunately the page consists of malformed content. I'm at an impasse on how to scrape this page. The current code I have (appearing below) freezes on parsing the HTML: HtmlNodeCollection ...

PHP, Zend Framework: How to fetch a page from another server, then deliver the content?

I think this might also be referred to as "scraping". Basically, what I want to do, is if someone clicks this link: <a href="/links/display/id/47">Click here</a> I want my links controller, display action to: find the actual url of link #47 from the database (i.e. http://www.google.com), fetch/scrape the content, display the content...

How can i save the screen as an image from a .NET forms application in C#

Hi, I have a .NET 3.5 Windows forms application. When the user keys in data and clicks 'Save', i want to save the entire form as an image file. How can i do this ? Thanks, Chak. ...

Programmatic Python Browser with JavaScript

I want to screen-scrape a web-site that uses JavaScript. There is mechanize, the programmatic web browser for Python. However, it (understandably) doesn't interpret javascript. Is there any programmatic browser for Python which does? If not, is there any JavaScript implementation in Python that I could use to attempt to create one? ...

web scraping/parsing of college course site

Trying to parse/scrape the course site for memphis. The site is "https://spectrumssb2.memphis.edu/pls/PROD/bwckgens.p_proc_term_date". It appears to be some sort of javascript issue, or dynamic generation of the text. I can see the underlying DOM structure using livehttpdheaders/Firefox, but not when I simply view the underlying source/t...