tags:

views:

444

answers:

5

I am trying to remove external links from an HTML document but keep the anchors but I'm not having much luck. The following regex

$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;

will match the beginning of an anchor tag and the end of an external link tag e.g.

<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->

so I end up with nothing instead of

<a HREF="#FN1" name="01">1</a>
some other html

It just so happens that all anchors have their href attribute in uppercase, so I know I can do a case sensitive match, but I don't want to rely on it always being the case in the future.

Is the something I can change so it only matches the one a tag?

A: 

How about using [^<>"] instead of .+? In other words,

$html =~ s{<a href="[^<>"]+?\.htm">(.+?)</a>}{$1}sig;

There are cases where this may fail, but perhaps it is enough for your particular problem.

Kinopiko
That works (and makes so much sense). Thanks.
Mark
Curse the wonky implementation of markdown in comments keeping me from writing an example of how this breaks ;)
hobbs
You don't say. Well, look now, at what it says: "There are cases where this may fail".
Kinopiko
This doesn't handle other attributes, attributes in other orders, using anything other than " for quoting, extra space at the end of <a >, and many other things. It always seems nice to start this way, but maintenance is perpetual and never-ending as you encounter new data. Do it correctly from the start and don't think about it again.
brian d foy
@hobbs: add your objection as an answer--clearly explain that it is an objection, and it won't work as a comment--you'll get at least one upvote from me, and hopefully if you explain it clearly enough, nobody will downvote you.
Axeman
A: 

Why not just only remove links for which the href attribute doesn't begin with a pound sign? Something like this:

html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;
Amber
Doesn't handle bare links--I know, bare links are gross, and you'll never find them in HTML I write or write a generator for, but they and single-quoted attributes fit the spec.
Axeman
+10  A: 

Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">) rather than putting together fragile solutions using s///.

If you are going to take the s/// route, at least be honest, do depend on href attributes being all upper case instead of putting up an illusion of flexibility.

Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $parser->get_token ) {
    if ($token->is_start_tag('a')) {
        my $href = $token->get_attr('href');
        if (defined $href and $href !~ /^#/) {
            print $parser->get_trimmed_text('/a');
            $parser->get_token; # discard </a>
            next;
        }
    }
    print $token->as_is;
}

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com"&gt;An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Output:

C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered

<p>Maybe you did not consider click here >>>
either</p>

NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html extension rather than .htm. Given that, I find your concern with not relying on the upper case HREF attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.

Sinan Ünür
+1 for the example he might not have considered. That's what _should have been_ my argument against regexes in this case.
Chris Lutz
I don't see how this is straightforward. You have code which depends extensively on knowledge of Perl's reference system and there are several "magic numbers" in the code, with no explanation. Further, it relies on another module, and CPAN modules often lack clear documentation. Your solution is compact, but it is far from straightforward, unless one is a Perl expert.
Kinopiko
@Kinopiko: 1. It's *correct*, unlike your solution, which breaks in any number of situations. 2. Code should be readable by someone who is *competent*. References are not a steep barrier to entry. A complete understanding of references is much easier for a beginner to come by than a complete understanding of regexes. 3. I would prefer `HTML::TokeParser::Simple` for its more readable interface, but if you can't spend a moment looking at the docs, once again, you fail it.
hobbs
And 4. A module is used because once again *this is not a trivial problem*. If you treat it as a trivial problem you get a solution that is *wrong*, like the original poster's and like yours. A module suited to the task is almost guaranteed to be *less* buggy.
hobbs
I agree that the magic numbers are confusing. This is a property of HTML::TokeParser rather than the general "don't parse with regexes". Using XML::LibXML's implementation of the W3C DOM would have been clearer, but more verbose.
jrockway
My first instinct is usually HTML::TreeBuilder, but I appreciate the use of a streaming parser rather than DOM (or pseudo-DOM) where the problem allows. The other nice thing about TokeParser::Simple is that it makes round-tripping much easier :)
hobbs
@jrockway On the other hand, the docs are very clear on what those numbers are http://search.cpan.org/perldoc/HTML::TokeParser::Simple#DESCRIPTION
Sinan Ünür
"CPAN modules often lack clear documentation". And that's why you should never consider using a module if you can have a quick go with a regular expression? Wow!
innaM
Who care is CPAN modules often do anything? It only matters what the CPAN module you need does.
brian d foy
+6  A: 

A bit more like a SAX type parser is HTML::Parser:

use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;

my $omitted;

sub tag_handler { 
    my ( $self, $tag_name, $text, $attr_hashref ) = @_;
    if ( $tag_name eq 'a' ) { 
        my $href = first {; defined } @$attr_hashref{ qw<href HREF> };
        $omitted = substr( $href, 0, 7 ) eq 'http://';
        return if $omitted;
    }
    print $text;
}

sub end_handler { 
    my $tag_name = shift;
    if ( $tag_name eq 'a' && $omitted ) { 
        $omitted = false;
        return;
    }
    print shift;
}

my $parser
    = HTML::Parser->new( api_version => 3
                       , default_h   => [ sub { print shift; }, 'text' ]
                       , start_h     => [ \&tag_handler, 'self,tagname,text,attr' ]
                       , end_h       => [ \&end_handler, 'tagname,text' ]
                       );
$parser->parse_file( $path_to_file ) or die $OS_ERROR;
Axeman
+1 BTW, see http://www.perlfoundation.org/perl5/index.cgi?pbp_module_recommendation_commentary on `Smart::Comments`. I am not sure if I feel that strongly, but I am in general not a fan of source filters.
Sinan Ünür
@Sinan Ünür: normally I remove debugging code from my finished answers. That's where Smart::Comments shines though, is debugging code.
Axeman
That's not bad, but eventually does rely on another regular expression. However, HTML::Parser will give you attributes and their values if you ask nicely.
innaM
@Manni: Agree--and I knew that it did that, but I didn't want to write a complicated pipe when I was altering the tag--but it makes a better solution, if I'm not writing out anything. I'm going to change it.
Axeman
A: 

Yet another solution. I love HTML::TreeBuilder and family.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
    if ($a->attr('href') !~ /^#/) {
        $a->replace_with_content($a->as_text);
    }
}
print $root->as_HTML(undef, "\t");

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com"&gt;An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>
Leonardo Herrera