views:

901

answers:

10

Let me preface this by saying I don't care what language this solution gets written in as long as it runs on windows. My problem is this. There is a site that has data which is frequently updated that I would like to get at regular intervals for later reporting. The site requires JavaScript to work properly so just using wget doesn't work. What is a good way to either imbed a browser in a program or use a stand-alone browser to routinely scrape the screen for this data? Ideally I'd like to grab certain tables on the page but can resort to regular expressions if necessary.

+9  A: 

You could probably use web app testing tools like Watir, Watin, or Selenium to automate the browser to get the values from the page. I've done this for scraping data before, and it works quite well.

Brian Sullivan
I've used WatiN for automating a javascript/html game before and easily retrieved values I needed.
Simucal
+3  A: 

If JavaScript is a must, you can try instantiating an Internet Explorer via ActiveX (CreateObject("InternetExplorer.Application")) and use it's Navigate2() Method to open your web page.

Set ie = CreateObject("InternetExplorer.Application")
ie.Visible = True
ie.Navigate2 "http://stackoverflow.com"

After the page has finished loading (check document.ReadyState), you have full access to the DOM and can use whatever methods to extract any content you like.

Tomalak
A: 

Give Badboy a try. It's meant to automate the system testing of your websites but you may find it's regular expression rules handy enough to do what you want.

Simon Johnson
+2  A: 

You can look at Beautiful Soup - being open source python, it is easily programmable. Quoting the site:

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  1. Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.
  2. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.
  3. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.
gimel
You should migrate to [HTML5lib](http://code.google.com/p/html5lib/) for parsing Web stuff.
hendry
+1  A: 

I would recommend Yahoo Pipes, that's exactly what they were built to do. Then you can get the yahoo pipes data as an RSS feed and do as you want with it.

Whaledawg
A: 

If you have Excel then you should be able to import the data from the webpage into Excel.

From the Data menu select Import External Data and then New Web Query.

Once the data is in Excel then you can either manipulate it within Excel or output it in a format (e.g. CSV) you can use elsewhere.

A: 

In compliment to Whaledawg's suggestion, I was going to suggest using an RSS scraper application (do a Google search) and then you can get nice raw XML to programmatically consume instead of a response stream. There may even be a few open-source implementation which would give you more of an idea if you wanted to implement yourself.

The Giraffe
+1  A: 

If you are familiar with Java (or perhaps, other language that runs on a JVM such as JRuby, Jython, etc.), you can use HTMLUnit; HTMLUnit simulates a complete browser; http requests, creating a DOM for each page and running Javascript (using Mozilla's Rhino).

Additionally, you can run XPath queries on documents loaded in the simulated browser, simulate events, etc.

http://htmlunit.sourceforge.net

alex
A: 

You could use the Perl module LWP, with module JavaScript. While this may not be the quickest to set up, it should work reliably. I would definitely not have this be your first foray into Perl though.

Brad Gilbert
I checked to see if ActiveState supported the JavaScript Perl module, and it appears that they do.
Brad Gilbert
A: 

I recently did some research on this topic. The best resource I found is this Wikipedia article, which gives links to many screen scraping engines.

I needed to have something that I can use as a server and run it in batch, and from my initial investigation, I think Web Harvest is quite good as an open source solution, and I have also been impressed by Screen Scraper, which seems to be very feature rich and you can use it with different languages.

There is also a new project called Scrapy, haven't checked it out yet, but it's a python framework.

Jean Barmash