views:

3297

answers:

7

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

<a href="http://www.google.com"&gt;Google&lt;/a&gt;

<a href="http://www.apple.com"&gt;Apple&lt;/a&gt;

The output would be:

 Google, http://www.google.com
 Apple, http://www.apple.com

What is the best way to do this in Perl?

+11  A: 

Have a look at HTML::LinkExtor, part of the HTML::Parser package.

Sherm Pendley
Unfortunately, HTML::LinkExtor can't give you the text inside the <a> tag, which he says he's interested in. It only tells you the tag name and its attributes.
cjm
+1  A: 

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

converter42
+4  A: 

I like using pQuery for things like this...

use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);

Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.

/I3az/

draegtun
+24  A: 

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.

Andy Lester
I took the liberty of changing the print statement to include the link text, as requested by melling.
cjm
+2  A: 

Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

Andy recommended WWW::Mechanize. That's probably the best solution.

If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

cjm
+2  A: 

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

ysth
+2  A: 

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }
Alexandr Ciornii