tags:

views:

137

answers:

1

I have a HUGE html which has many things I don't need, but inside it has URLs that are provided in the following format:

<a href="http://www.retailmenot.com/" class=l

I'm trying to extract the URLs... I tried, to no avail:

open(FILE,"<","HTML.htm") or die "$!";
my @str = <FILE>;

my @matches = grep { m/a href="(.+?") class=l/ } @str

Any idea on how to match this?

+10  A: 

Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don't need a regex at all.

Here's a short example. You don't have to subclass. You just have to tell %HTML::Tagset::linkElements which attributes to collect:

#!perl
use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

$p = HTML::LinkExtor->new;
$p->parse( do { local $/; <> } );

my @links = grep { 
    my( $tag, %hash ) = @$_;
    no warnings 'uninitialized';
    $hash{class} eq 'foo';
    } $p->links;

If you need to collect URLs for any other tags, you make similar adjustments.

If you'd rather have a callback routine, that's not so hard either. You can watch the links as the parser runs into them:

use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

my @links;
my $callback = sub {
    my( $tag, %hash ) = @_;
    no warnings 'uninitialized';
    push @links, $hash{href} if $hash{class} eq 'foo';
    };

my $p = HTML::LinkExtor->new( $callback );
$p->parse( do { local $/; <DATA> } );
brian d foy
Great module, but seems like I dont only need the hrefs but , hrefs that have 'class=l ' after the link...
soulSurfer2010
HTML::LinkExtor can help you figure out what other attributes are set.
brian d foy
@brian d foy, HTML::LinkExtor only collects attributes that are URLs. It doesn't collect the `class` attribute. You'd have to subclass it to ignore links with the wrong `class`.
cjm
Sorry that I didn't have time earlier to produce an example. No need for a subclass.
brian d foy
"You don't need a regex at all." And you should not use a regex at all. It has been said that if any phrase ought to be emblazoned on the top of SO, "you cannot use regular expressions to parse XML" is certainly one of them.
Jon Purdy
amazing!! amazing!! and again! amazing! Thanks a lot!
soulSurfer2010