ansaurus

Question

What is the easiest way to programmatically extract structured data from a bunch of web pages?

Answer 1

A:

I use a combination of Ruby with hpricot and watir gets the job done very efficiently

Alon 2009-12-18 19:49:19

Answer 2

+2 A:

I found YQL to be very powerful and useful for this sort of thing. You can select any web page from the internet and it will make it valid and then allow you to use XPATH to query sections of it. You can output it as XML or JSON ready for loading into another script/ application.

I wrote up my first experiment with it here:

http://www.kelvinluck.com/2009/02/data-scraping-with-yql-and-jquery/

Since then YQL has become more powerful with the addition of the EXECUTE keyword which allows you to write your own logic in javascript and run this on Yahoo!s servers before returning the data to you.

A more detailed writeup of YQL is here.

You could create a datatable for YQL to get at the basics of the information you are trying to grab and then the person in charge of data acquisition could write very simple queries (in a DSL which is prettymuch english) against that table. It would be easier for them than "proper programming" at least...

vitch 2009-12-18 19:56:43

Answer 3

+11 A:

If you do a search on Stackoverflow for WWW::Mechanize & pQuery you will see many examples using these Perl CPAN modules.

However because you have mentioned "non-programmer" then perhaps Web::Scraper CPAN module maybe more appropriate? Its more DSL like and so perhaps easier for "non-programmer" to pick up.

Here is an example from the documentation for retrieving tweets from Twitter:

use URI;
use Web::Scraper;

my $tweets = scraper {
    process "li.status", "tweets[]" => scraper {
        process ".entry-content",    body => 'TEXT';
        process ".entry-date",       when => 'TEXT';
        process 'a[rel="bookmark"]', link => '@href';
    };
};

my $res = $tweets->scrape( URI->new("http://twitter.com/miyagawa") );

for my $tweet (@{$res->{tweets}}) {
    print "$tweet->{body} $tweet->{when} (link: $tweet->{link})\n";
}

/I3az/

draegtun 2009-12-18 20:19:31

Answer 4

A:

If you don't mind it taking over your computer, and you happen to need javasript support, WatiN is a pretty damn good browsing tool. Written in C#, it has been very reliable for me in the past, providing a nice browser-independent wrapper for running through and getting text from pages.

Robert P 2009-12-18 22:55:36

Answer 5

+2 A:

There is Sprog, which lets you graphically build processes out of parts (Get URL -> Process HTML Table -> Write File), and you can put Perl code in any stage of the process, or write your own parts for non-programmer use. It looks a bit abandoned, but still works well.

MkV 2009-12-19 00:42:19

Answer 6

A:

Are commercial tools viable answers? If so check out http://screen-scraper.com/ it is super easy to setup and use to scrape websites. They have free version which is actually fairly complete. And no, I am not affiliated with the company :)

Ryan K 2009-12-23 05:16:54

ansaurus

tags:

views:

answers:

What is the easiest way to programmatically extract structured data from a bunch of web pages?

related questions