ansaurus

Question

Answer 1

+1 A:

Your call to look_down() can't distinguish between the links you want and the links you don't. Try a stronger filter like

@search_results = $page->look_down(
    sub {$_[0]->{tag} eq 'a'  &&
         $_[0]->attr('href') =~ /\?PA=/}); # only match http://...?PA=...

mobrule 2010-07-06 19:39:07

Answer 2

A:

I would be inclined to use HTML::TokeParser::Simple for this just to avoid the overhead of building a document tree:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('t.html');

while ( my $tag = $parser->get_tag('a') ) {
    my $href = $tag->get_attr('href');
    next unless $href =~ /runner\.php\?PA=[0-9]+\z/;

    print $tag->as_is;

    while ( my $token = $parser->get_token ) {
        print $token->as_is;
        last if $token->is_end_tag('/a');
    }
    print "\n";
}

Output:

<a href="http://valeptr.com/scripts/runner.php?PA=33425" target="_ptc" onclick="javascript:reloadpage(11)"> <img src="1appsearch.php_files/runner_007.gif" alt="Xray-cash" border="0"> </a> ... etc

Sinan Ünür 2010-07-06 22:42:10

ansaurus

tags:

views:

answers:

extracting text from HTML (Perl)

related questions