ansaurus

Question

How can I extract the HREF value from an HTML link?

Answer 1

+8 A:

Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.

Michael Carman 2009-05-29 16:14:33

+1. Generally using a regexp here is the wrong solution.

Alex Feinman 2009-05-29 16:43:12

Even better is either HTML::LinkExtor or HTML::SimpleLinkExtor. You don't have to handle the parsing details directly.

brian d foy 2009-05-29 21:10:38

Answer 2

A:

When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:

/HREF="([^"]*)"[^>]*>/i

That should match much more consistently.

Stephan 2009-05-29 16:33:25

Answer 3

+8 A:

To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.

But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.

Chris Simmons 2009-05-29 16:38:18

Answer 4

A:

If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath

It gives you the power of XPath in non-well-formed HTML.

use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my @hrefs = $tree->findvalues( '//div[@class="noprint"]/a/@href');
print "The links are: ", join( ',', @hrefs ), "\n";

Axeman 2009-05-29 21:34:14

ansaurus

tags:

views:

answers:

How can I extract the HREF value from an HTML link?

related questions