ansaurus

Question

How do I extract links from HTML with a Perl regex?

Answer 1

+10 A:

Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don't need a regex at all.

Here's a short example. You don't have to subclass. You just have to tell %HTML::Tagset::linkElements which attributes to collect:

#!perl
use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

$p = HTML::LinkExtor->new;
$p->parse( do { local $/; <> } );

my @links = grep { 
    my( $tag, %hash ) = @$_;
    no warnings 'uninitialized';
    $hash{class} eq 'foo';
    } $p->links;

If you need to collect URLs for any other tags, you make similar adjustments.

If you'd rather have a callback routine, that's not so hard either. You can watch the links as the parser runs into them:

use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

my @links;
my $callback = sub {
    my( $tag, %hash ) = @_;
    no warnings 'uninitialized';
    push @links, $hash{href} if $hash{class} eq 'foo';
    };

my $p = HTML::LinkExtor->new( $callback );
$p->parse( do { local $/; <DATA> } );

brian d foy 2010-09-25 00:41:46

Great module, but seems like I dont only need the hrefs but , hrefs that have 'class=l ' after the link...

soulSurfer2010 2010-09-25 00:46:15

HTML::LinkExtor can help you figure out what other attributes are set.

brian d foy 2010-09-25 00:48:43

@brian d foy, HTML::LinkExtor only collects attributes that are URLs. It doesn't collect the `class` attribute. You'd have to subclass it to ignore links with the wrong `class`.

cjm 2010-09-25 03:28:12

Sorry that I didn't have time earlier to produce an example. No need for a subclass.

brian d foy 2010-09-25 04:12:16

"You don't need a regex at all." And you should not use a regex at all. It has been said that if any phrase ought to be emblazoned on the top of SO, "you cannot use regular expressions to parse XML" is certainly one of them.

Jon Purdy 2010-09-25 06:07:06

amazing!! amazing!! and again! amazing! Thanks a lot!

soulSurfer2010 2010-09-25 13:21:02

ansaurus

tags:

views:

answers:

How do I extract links from HTML with a Perl regex?

related questions