views:

225

answers:

6

What is the easiest way to programmatically extract structured data from a bunch of web pages?

I am currently using an Adobe AIR program I have written to follow the links on one page and grab a section of data off of the subsequent pages. This actually works fine, and for programmers I think this(or other languages) provides a reasonable approach, to be written on a case by case basis. Maybe there is a specific language or library that allows a programmer to do this very quickly, and if so I would be interested in knowing what they are.

Also do any tools exist which would allow a non-programmer, like a customer support rep or someone in charge of data acquisition, to extract structured data from web pages without the need to do a bunch of copy and paste?

A: 

I use a combination of Ruby with hpricot and watir gets the job done very efficiently

Alon
+2  A: 

I found YQL to be very powerful and useful for this sort of thing. You can select any web page from the internet and it will make it valid and then allow you to use XPATH to query sections of it. You can output it as XML or JSON ready for loading into another script/ application.

I wrote up my first experiment with it here:

http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/

Since then YQL has become more powerful with the addition of the EXECUTE keyword which allows you to write your own logic in javascript and run this on Yahoo!s servers before returning the data to you.

A more detailed writeup of YQL is here.

You could create a datatable for YQL to get at the basics of the information you are trying to grab and then the person in charge of data acquisition could write very simple queries (in a DSL which is prettymuch english) against that table. It would be easier for them than "proper programming" at least...

vitch
+11  A: 

If you do a search on Stackoverflow for WWW::Mechanize & pQuery you will see many examples using these Perl CPAN modules.

However because you have mentioned "non-programmer" then perhaps Web::Scraper CPAN module maybe more appropriate? Its more DSL like and so perhaps easier for "non-programmer" to pick up.

Here is an example from the documentation for retrieving tweets from Twitter:

use URI;
use Web::Scraper;

my $tweets = scraper {
    process "li.status", "tweets[]" => scraper {
        process ".entry-content",    body => 'TEXT';
        process ".entry-date",       when => 'TEXT';
        process 'a[rel="bookmark"]', link => '@href';
    };
};

my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );

for my $tweet (@{$res->{tweets}}) {
    print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}

/I3az/

draegtun
A: 

If you don't mind it taking over your computer, and you happen to need javasript support, WatiN is a pretty damn good browsing tool. Written in C#, it has been very reliable for me in the past, providing a nice browser-independent wrapper for running through and getting text from pages.

Robert P
+2  A: 

There is Sprog, which lets you graphically build processes out of parts (Get URL -> Process HTML Table -> Write File), and you can put Perl code in any stage of the process, or write your own parts for non-programmer use. It looks a bit abandoned, but still works well.

MkV
A: 

Are commercial tools viable answers? If so check out http://screen-scraper.com/ it is super easy to setup and use to scrape websites. They have free version which is actually fairly complete. And no, I am not affiliated with the company :)

Ryan K