tags:

views:

775

answers:

4

My text file contains 2 lines:

<IMG SRC="/icons/folder.gif" ALT="[DIR]"> <A HREF="yahoo.com.jp/">yahoo.com.jp/</A>
</PRE><HR>

In my Perl script, I have:

my $String =~ /.*(HREF=")(.*)(">)/;
print "$2";

and my output is the following:

Output 1: yahoo.com.jp

Output 2: ><HR>

What I am trying to achieve is have my Perl script automatically extract the string inside the <A Href="">

As I am very new to regex, I want to ask if my regex is a badly formed one? If so can someone provide some suggestion to make it look nicer?

Secondly, I do not know why my second output is "><HR>", I thought the expected behavior is that output2 will be skipped since it does not contain HREF=". Obviously I am very wrong.

Thanks for the help.

+8  A: 

Using regular expressions to parse HTML works just often enough to lull you into a false sense of security. You can get away with it for simple cases where you control the input but you're better off using something like HTML::Parser instead.

Michael Carman
+1. Generally using a regexp here is the wrong solution.
Alex Feinman
Even better is either HTML::LinkExtor or HTML::SimpleLinkExtor. You don't have to handle the parsing details directly.
brian d foy
A: 

When trying to match against HTML (or XML) with a regex you have to be careful about using . Rarely ever do you want a . because start is a greedy modifier that will match as far as it can. as Gumbo showed use the character class specifier [^"]* to match all characters except a quote. This will match till the end quote. You may also want to use something similar for matching the angle bracket. Try this:

/HREF="([^"]*)"[^>]*>/i

That should match much more consistently.

Stephan
+8  A: 

To answer your specific question about why your regex isn't working, you're using .*, which is "greedy" - it will by default match as much as you can. Alternatives would be using the non-greedy form, .*?, or be a bit more exacting about what you're trying to match. For instance, [^"]* will match anything that's not a double quote, which seems to be what you're looking for.

But yes, the other posters are correct - using regular expressions to do anything non-trivial in HTML parsing is a recipe for disaster. Technically you can do it properly, especially in Perl 5.10 (which has more advanced regular expression features), but it's usually not worth the headache.

Chris Simmons
A: 

If I may, I'd like to suggest the simplest way of doing this (it may not be the fastest or lightest-weight way): HTML::TreeBuilder::XPath

It gives you the power of XPath in non-well-formed HTML.

use HTML::TreeBuilder::XPath;

my $tree= HTML::TreeBuilder::XPath->new_from_file( 'D:\Archive\XPath.pm.htm' );
my @hrefs = $tree->findvalues( '//div[@class="noprint"]/a/@href');
print "The links are: ", join( ',', @hrefs ), "\n";
Axeman