tags:

views:

74

answers:

2

Thanks to everyone who has helped me get this far.

Now my new problem. I'm working with a book that was written in 2003 and the tutorial is trying to spider a page that has changed.

The original address is: "http://www.oreilly.com/catalog/prdindex.html" this page no longer exists but it does redirect to the new page: "http://oreilly.com/store/complete.html"

The Problem "I think" is the html code has changed over the 7 years. The code used to be something like this:

<tr bgcolor="#ffffff">
<td valign="top">
<a href="http://oreilly.com/catalog/googlehks"&gt;Google Hacks</a><br />
</td>
<td valign="top" nowrap="nowrap">0-596-00447-8</td>
<td valign="top" align="right">$24.95</td>
<td valign="top" nowrap="nowrap" align="center"> 
<a href="http://safari.oreilly.com/0596004478"&gt;Read it on Safari</a>
</td>
<td valign="top" nowrap="nowrap">
<a href="http://examples.oreilly.com/googlehks"&gt;Get examples</a>
</td>
</tr>

So anyways the html has changed. You can look at it by viewing the source code on your browser.

When I run the script I get this error:

Use of uninitialized value in subroutine entry at /usr/lib/perl5/site_perl/5.8.8/HTML/TreeBuilder.pm line 93. Can't call method "as_HTML" on an undefined value at ./SpiderTutorial_19_09.pl line 67. There are 0 Perl books and 0 Java books. 0 more Java than Perl.

Here is the code I'm trying to run.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TreeBuilder;

my $url = 'http://oreilly.com/store/complete.html';
my $page = get( $url ) or die $!;
my $p = HTML::TreeBuilder->new_from_content( $page );
my($book);
my($edition);

my @links = $p->look_down(
        _tag => 'a',
        href => qr{^ \Qhttp://oreilly.com/complete/\E \w+ 

$}x
);

my @rows = map { $_->parent->parent } @links;

my @books;
for my $row (@rows) {
        my %book;
        my @cells = $row->look_down( _tag => 'td' );
        $book{title}    =$cells[0]->as_trimmed_text;
        $book{price}    =$cells[2]->as_trimmed_text;
        $book{price} =~ s/^\$//;

        $book{url}              = get_url( $cells[0] );
        $book{ebook}    = get_url( $cells[3] );
        $book{safari}   = get_url( $cells[4] );
        $book{examples} = get_url( $cells[5] );
        push @books, \%book;
}

sub get_url {
        my $node = shift;
        my @hrefs = $node->look_down( _tag => 'a');
        return unless @hrefs;
        my $url = $hrefs[0]->atr('href');
        $url =~ s/\s+$//;
        return $url;
}

$p = $p->delete; #we don't need this anymore.

{
        my $count = 1;
        my @perlbooks = sort { $a->{price} <=> $b->{price} }
                                        grep { $_->{title} =~/perl/i } @books;
        print $count++, "\t", $_->{price}, "\t", $_->{title} for @perlbooks;
}

{
        my @perlbooks = grep { $_->{title} =~ /perl/i } @books;
        my @javabooks = grep { $_->{title} =~ /java/i } @books;
        my $diff =  @javabooks - @perlbooks;
        print "There are ".@perlbooks." Perl books and ".@javabooks. " Java books. $diff more Java than Perl.";
}

for my $book ( $books[34] ) {
        my $url = $book->{url};
        my $page = get( $url );
        my $tree = HTML::TreeBuilder->new_from_content( $page );
        my ($pubinfo) = $tree->look_down(


        _tag => 'span',


        class => 'secondary2'
        );
        my $html = $pubinfo->as_HTML; print $html;
        my ($pages) = $html =~ /(\d+) pages/,
        my ($edition) = $html =~ /(\d)(?:st|nd|rd|th) Edition/;
        my ($date) = $html =~ /(\w+ (19|20)\d\d)/;

        print "\n$pages $edition $date\n";

        my ($img_node) = $tree->look_down(


        _tag => 'img',


        src  => qr{^/catalog/covers/},
        );
        my $img_url = 'http://www.oreilly.com'.$img_node-&gt;attr('src');
        my $cover = get( $img_url );
        # now save $cover to disk
}
+5  A: 

Errors of the form:

 Can't call method _________ on an undefined value at _________  line ___

Mean you have a construct like this:

 $object->method 

And the value of the thing to the left ($object) is undefined.

That means in your case near line 67, $pubinfo is undefined. You have to visually work back up through the code to find out why. In this case $tree->look_down() must have returned an undefined value.

This probably has every thing to do with the structure of the page changing, as has been pointed out. Elements aren't where they should be any more. Get the source code for the HTML page, and the code and see if you can understand what it was trying to do originally and apply it to the new page. Hopefully, the book was good enough that you understand the code even without a working example.

clintp
A: 

When you use HTML::Treebuilder's look methods you need to handle cases where no results come back.

If HTB looks down the page and finds nothing, you will get exactly the error you are experiencing.

Where you have:

my ($pubinfo) = $tree->look_down(
    _tag => 'span',
    class => 'secondary2'
);

my $html = $pubinfo->as_HTML; print $html;

Do this to skip books with no pubinfo:

my ($pubinfo) = $tree->look_down(
    _tag => 'span',
    class => 'secondary2'
);

next unless $pubinfo;  # trap no results.    

my $html = $pubinfo->as_HTML; print $html;

Or try this to display a default message:

my ($pubinfo) = $tree->look_down(
    _tag => 'span',
    class => 'secondary2'
);

my $html = $pubinfo 
         ? $pubinfo->as_HTML
         : '<span>No Publisher Info Available</span>';
print $html;

ANY time you do something that may return uncertain results you need to examine the results and verify that they meet your expectations. In this code, you should be checking the results from get and every look operation.

BTW, why are you using a for loop to iterate over one item? (for my $book ( $books[34] )). I'm not sure what this is buying you apart from an enclosing block scope for the contents of the loop.

daotoad