views:

1448

answers:

4

How can I extract information from a website (http://tv.yahoo.com/listings) and then create an XML file out of it? I want to save it so to parse later and display information using JavaScrit?

I am quite new to Perl and I have no idea about how to do it.

+11  A: 

Of course. The easiest way would be the Web::Scraper module. What it does is it lets you define scraper objects that consist of

  1. hash key names,
  2. XPath expressions that locate elements of interest,
  3. and code to extract bits of data from them.

Scraper objects take a URL and return a hash of the extracted data. The extractor code for each key can itself be another scraper object, if necessary, so that you can define how to scrape repeated compound page elements: provide the XPath to find the compound element in an outer scraper, then provide a bunch more XPaths to pull out its individual bits in an inner scraper. The result is then automatically a nested data structure.

In short, you can very elegantly suck data from all over a page into a Perl data structure. In doing so, the full power of XPath + Perl is available for use against any page. Since the page is parsed with HTML::TreeBuilder, it does not matter how nasty a tagsoup it is. The resulting scraper scripts are much easier to maintain and far more tolerant of minor markup variations than regex-based scrapers.

Bad news: as yet, its documentation is almost non-existent, so you have to get by with googling for something like [miyagawa web::scraper] to find example scripts posted by the module’s author.

Aristotle Pagaltzis
Corion
Do you really want to recommend this kind of beta module?
Beta, really? It’s glue for a combo of LWP, HTML::TreeBuilder and HTML::Selector::XPath, all battle-tested production-quality modules. If you enjoy writing boilerplate, though, suit yourself…
Aristotle Pagaltzis
I haven't tried it so perhaps I jumped to conclusions. But the author notes "THIS MODULE IS IN ITS BETA QUALITY. THE API IS STOLEN FROM SCRAPI BUT MAY CHANGE IN THE FUTURE"
+2  A: 

While in general LWP::Simple or WWW::Mechanize and HTML::Tree are good ways to extract data from web pages, in this particular case (TV listings) there's a much easier way:

Use XMLTV with data from Schedules Direct. There is a small fee (US$20/year), but there are advantages:

  1. The parsing code is already written for you (just use XMLTV;).
  2. You won't be violating Yahoo's terms of service.
  3. You won't have to deal with Yahoo actively trying to break your script. (They don't like automated scripts pulling down TV listings; see #2.)
cjm
+1  A: 

If you want to pass the information to Javascript, use Javascript Object Notation (JSON) instead of XML. There are plenty of Perl libraries, such as JSON::Any, that can handle that for you.

brian d foy
+1  A: 

tv.yahoo.com is not very semantic and not very easy to scrape! They're maybe better alternatives or feeds?

Using pQuery I can quickly get times & shows....

use pQuery;
pQuery( 'http://tv.yahoo.com/listings' )
    ->find( '.show' )->each(
        sub {
            my $n = shift;
            my $pQ = pQuery( $_ ); 
            say $pQ->text;
        }
    );

  # => 4:00pm - 6:30pm Local Programming

To scrape details a bit more u can try this....

use pQuery;
my @tv_progs;
pQuery( 'http://tv.yahoo.com/listings' )
    ->find( 'li div strong' )->each(
        sub {
            my $n = shift;
            my $pQ = pQuery( $_ ); 
            $tv_progs[ $n ]->{ time } = $pQ->text;
        }
    )
    ->end
    ->find( '.showTitle' )->each( 
        sub {
            my $n = shift;
            my $pQ = pQuery( $_ ); 
            $tv_progs[ $n ]->{ name } = $pQ->text;
        }
    );

for my $prog ( @tv_progs ) {
    say $prog->{name} . " @ " . $prog->{time};
}

   # => Local Programming @ 4:00pm - 6:30pm

And to get channel....

use pQuery;
pQuery( 'http://tv.yahoo.com/listings' )
->find( '.chhdr a' )->each(
    sub {
        my $n = shift;
        my $pQ = pQuery( $_ ); 
        say $pQ->text;
    }
);

  # => ABC

However matching back channel to programme info will require a bit of work ;-)

/I3az/

draegtun