Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">
) rather than putting together fragile solutions using s///
.
If you are going to take the s///
route, at least be honest, do depend on href
attributes being all upper case instead of putting up an illusion of flexibility.
Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
while ( my $token = $parser->get_token ) {
if ($token->is_start_tag('a')) {
my $href = $token->get_attr('href');
if (defined $href and $href !~ /^#/) {
print $parser->get_trimmed_text('/a');
$parser->get_token; # discard </a>
next;
}
}
print $token->as_is;
}
__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>
<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>
Output:
C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered
<p>Maybe you did not consider click here >>>
either</p>
NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html
extension rather than .htm
. Given that, I find your concern with not relying on the upper case HREF
attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF
and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.