ansaurus

Question

Parse HTML Page For Links With Regex Using Perl

Answer 1

+9 A:

Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.

Or, consider the following simple example:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

my @hrefs;

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
        push @hrefs, $href if $href =~ m!/en/subtitles/!;
    }
}

print "$_\n" for @hrefs;

__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath 
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" 
class="bnone">Death Becomes Her
                (1992)</a>

Output:

/en/subtitles/3586224/death-becomes-her-en

Sinan Ünür 2009-11-05 21:03:41

Metaphysical +1 (I'm out of upvotes).

Chris Lutz 2009-11-05 21:08:58

Thank you, Chris. Been in that situation many times ;-)

Sinan Ünür 2009-11-05 21:25:37

Answer 2

A:

URLs like the one in your example can be matched with a regular expression like

($url) = /href=\"([^\"]+)\"/i

If the HTML ever uses single quotes (or no quotes) around a URL, or if there are ever quote characters in the URL, then this will not work quite right. For this reason, you will get some answers telling you not to use regular expressions to parse HTML. Heed them but carry on if you are confident that the input will be well behaved.

mobrule 2009-11-05 21:08:44

Answer 3

+4 A:

Don't use regexes. Use an HTML parser like HTML::TreeBuilder.

my @links;
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
$tree->elementify;

my @links = map { $_->attr('href') } $tree->look_down( _tag => 'a');

$tree = $tree->delete;

# Do stuff with links array

daotoad 2009-11-05 21:08:53

+1 It works but for files of unknown size, I tend to avoid building the whole document tree.

Sinan Ünür 2009-11-05 21:14:41

HTML::TreeBuilder has handled all my needs with ease. I've never needed to parse huge HTML files that needed one of the line-by-line type parsers, so I can't just dash such a script off. However, if you've got huge files, you definitely don't want to hold the whole tree in RAM.

daotoad 2009-11-06 07:18:17

ansaurus

tags:

views:

answers:

Parse HTML Page For Links With Regex Using Perl

related questions