screen-scraping

extract value from web page

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has. The information is in a div:- <div class="pager"> <span class="page-numbers current">1</span> <a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a> <a href="/users?page=3" title="go to...

Would this Asynchronous download work? WebClient.DownloadDataAsyn() problem.

I have this class called SiteAsyncDownload.cs Here's the code: public class SiteAsyncDownloader { WebClient Client = new WebClient(); string SiteSource = null; /// <summary> /// Download asynchronously the source code of any site in string format. /// </summary> /// <param name="URL">Site URL to download.</para...

Using Rake To Scrape ASP.NET Page

Is it possible, and if so how, do I use RAKE to scrape an ASP.Net Application (very simple, just 2 login forms) - Basically a spider bot/web crawler. I only ask since I've heard this mentioned before and wonder what method I would use to go about doing it? Help greatly appreciated. ...

I'm new to Perl and have a few regex questions

I'm teaching myself Perl and I learn best by example. As such, I'm studying a simple Perl script that scrapes a specific blog and have found myself confused about a couple of the regex statements. The script looks for the following chunks of html: <dt><a name="2004-10-25"><strong>October 25th</strong></a></dt> <dd> <p> [Cont...

DownloadData() produces HTML different from the browser

I'm trying to download the source HTML of a website using the WebClient.DownloadData() method. My method is supposed to give me the source: public string GetSite(string URL) { Uri Site = new Uri(URL); byte[] lol = Client.DownloadData(Site); SiteSource = Encoding.ASCII.GetString(lol); return SiteSourc...

Scraping and Parsing a Wikipedia Page

Hey guys. I'm wondering if there are any existing libraries in or accessible from Objective-C that would allow me to scrape pages formatted like this one. Specifically, all of the dates and all of the text next to each date. If not, what would be the best way to go about doing this? Regular expressions? I heard that NSString might alread...

Get the type of an element in Hpricot

I want to go through the children of an element and filter only the ones that are text or span, something like: element.children.select {|child| child.class == String || child.element_type == 'span' } but I can't find a way to test which type a certain element is. How do I test that? I'd like to know that regardless if there's a bet...

How to scrape web pages that are in different format/layouts ?

I need to scrape Form 10-K reports (i.e. annual reports of US companies) from SEC website for a project. The trouble is, companies do not use the exact same format for filing this data. So for ex., real estate data for 2 different companies could be displayed as below 1st company Property name State City Ownership Year Occu...

How to record screen and save as gif animation?

Is there such a software? ...

Scraping Library for PHP - phpQuery?

I'm looking for a PHP library that allows me to scrap webpages and takes care about all the cookies and prefilling the forms with the default values, that's what annoys me the most. I'm tired of having to match every single input element with xpath and I would love if something better existed. I've come across phpQuery but the manual is...

How do I capture a web applications screen to attach to an e-mail on error?

I am working on a web application and we would like to capture the screen (either the applications current screen or the whole screen) and attach this to an e-mail that is automatically generated for error messages. I've seen a few posts about how to do this in a winform app, but nothing really on how to do it in a web app. Is it the s...

How to detect mailto links with Hpricot/Nokogiri

I want to match links like <a href="mailto:[email protected]">foo</a>, but this doesn't work only works in Nokogiri: doc/'a[href ^="mailto:"]' What's the right way of doing that? How do I do that with Hpricot? ...

How to capture a part of a screen using Ruby on Windows?

Instead of using some third party app, I'd like to write an app in Ruby that when invoked, will capture the full screen and save it in c:\screenshot\snap000001.png The graphic package is readily there, but how can you capture a region from the full screen so as to save it? This program is to be invoked by some hot-key, such as settin...

Windows Media Encoding C# capture window - how?

Env.: Windows Media Encoding 9 SDK, C#. I have a task to capture window video. Successfully captured desktop, as explained in C# code samples. Now trying capture window or area. C++ sample uses PropertyBag to specify area. Please help how specify region/window for C# capture code? I use Windows Media Encoding, because it simplifies follo...

Screen Scrape Help!

I need some help with screen scraping a site (http://website.com). Lets say I'm trying to get an image inside But when I pull it down, it's path is relative ie "image_large/imageName.jpg" (I'm going to pull down this image daily as it changes daily. It always begins with "images_large/. How can I go in and prepend the url website.com...

BeautifulSoup is omitting body of page

BeautifulSoup newbe... Need help Here is the code sample... from mechanize import Browser from BeautifulSoup import BeautifulSoup mec = Browser() #url1 = "http://www.wines.com/catalog/index.php?cPath=21" url2 = "http://www.wines.com/catalog/product_info.php?products_id=4866" page = mec.open(url2) html = page.read() soup = BeautifulSou...

Selenium: Not able to understand xPath

I have some HTML like this: <h4 class="box_header clearfix"> <span> <a rel="dialog" href="http://www.google.com/?q=word"&gt;Search&lt;/a&gt; </span> <small> <span> <a rel="dialog" href="http://www.google.com/?q=word"&gt;Search&lt;/a&gt; </span> </h4> I am trying to get the href here in Java using Selenium. I have tried the following: ...

How to retrieve a directory of files from a remote server?

If I have a directory on a remote web server that allows directory browsing, how would I go about to fetch all those files listed there from my other web server? I know I can use urllib2.urlopen to fetch individual files, but how would I get a list of all the files in that remote directory? ...

What's the best way to write a maintainable web scraping app?

I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changi...

HTML Scraping with Hpricot (Using Ruby on Rails)

hi, I have read a large deal of tutorials to help out and under Hpricot, the problem that i am finding out it is not scraping all the Html so to speak. I'll elaborate: The website i am attempting to scrape html off is http://yellowpages.com.mt/Malta-Search/Radio-In-Malta-Gozo.aspx . I require to obtain the links that are listed as resu...