ansaurus

Question

How can I remove external links from HTML using Perl?

Answer 1

A:

How about using [^<>"] instead of .+? In other words,

$html =~ s{<a href="[^<>"]+?\.htm">(.+?)</a>}{$1}sig;

There are cases where this may fail, but perhaps it is enough for your particular problem.

Kinopiko 2009-10-21 00:18:06

That works (and makes so much sense). Thanks.

Mark 2009-10-21 00:33:56

Curse the wonky implementation of markdown in comments keeping me from writing an example of how this breaks ;)

hobbs 2009-10-21 03:32:10

You don't say. Well, look now, at what it says: "There are cases where this may fail".

Kinopiko 2009-10-21 03:40:57

This doesn't handle other attributes, attributes in other orders, using anything other than " for quoting, extra space at the end of <a >, and many other things. It always seems nice to start this way, but maintenance is perpetual and never-ending as you encounter new data. Do it correctly from the start and don't think about it again.

brian d foy 2009-10-21 11:14:28

@hobbs: add your objection as an answer--clearly explain that it is an objection, and it won't work as a comment--you'll get at least one upvote from me, and hopefully if you explain it clearly enough, nobody will downvote you.

Axeman 2009-10-21 19:59:34

Answer 2

A:

Why not just only remove links for which the href attribute doesn't begin with a pound sign? Something like this:

html =~ s/<a href="[^#][^"]*?">(.+?)<\/a>/$1/sig;

Amber 2009-10-21 00:24:47

Doesn't handle bare links--I know, bare links are gross, and you'll never find them in HTML I write or write a generator for, but they and single-quoted attributes fit the spec.

Axeman 2009-10-21 20:01:35

Answer 3

+10 A:

Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">) rather than putting together fragile solutions using s///.

If you are going to take the s/// route, at least be honest, do depend on href attributes being all upper case instead of putting up an illusion of flexibility.

Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $parser->get_token ) {
    if ($token->is_start_tag('a')) {
        my $href = $token->get_attr('href');
        if (defined $href and $href !~ /^#/) {
            print $parser->get_trimmed_text('/a');
            $parser->get_token; # discard </a>
            next;
        }
    }
    print $token->as_is;
}

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com"&gt;An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Output:

C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered

<p>Maybe you did not consider click here >>>
either</p>

NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html extension rather than .htm. Given that, I find your concern with not relying on the upper case HREF attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.

Sinan Ünür 2009-10-21 01:24:27

+1 for the example he might not have considered. That's what _should have been_ my argument against regexes in this case.

Chris Lutz 2009-10-21 02:18:08

I don't see how this is straightforward. You have code which depends extensively on knowledge of Perl's reference system and there are several "magic numbers" in the code, with no explanation. Further, it relies on another module, and CPAN modules often lack clear documentation. Your solution is compact, but it is far from straightforward, unless one is a Perl expert.

Kinopiko 2009-10-21 02:39:20

@Kinopiko: 1. It's *correct*, unlike your solution, which breaks in any number of situations. 2. Code should be readable by someone who is *competent*. References are not a steep barrier to entry. A complete understanding of references is much easier for a beginner to come by than a complete understanding of regexes. 3. I would prefer `HTML::TokeParser::Simple` for its more readable interface, but if you can't spend a moment looking at the docs, once again, you fail it.

hobbs 2009-10-21 03:12:24

And 4. A module is used because once again *this is not a trivial problem*. If you treat it as a trivial problem you get a solution that is *wrong*, like the original poster's and like yours. A module suited to the task is almost guaranteed to be *less* buggy.

hobbs 2009-10-21 03:13:52

I agree that the magic numbers are confusing. This is a property of HTML::TokeParser rather than the general "don't parse with regexes". Using XML::LibXML's implementation of the W3C DOM would have been clearer, but more verbose.

jrockway 2009-10-21 03:20:57

My first instinct is usually HTML::TreeBuilder, but I appreciate the use of a streaming parser rather than DOM (or pseudo-DOM) where the problem allows. The other nice thing about TokeParser::Simple is that it makes round-tripping much easier :)

hobbs 2009-10-21 03:25:07

@jrockway On the other hand, the docs are very clear on what those numbers are http://search.cpan.org/perldoc/HTML::TokeParser::Simple#DESCRIPTION

Sinan Ünür 2009-10-21 03:25:54

"CPAN modules often lack clear documentation". And that's why you should never consider using a module if you can have a quick go with a regular expression? Wow!

innaM 2009-10-21 08:17:20

Who care is CPAN modules often do anything? It only matters what the CPAN module you need does.

brian d foy 2009-10-21 11:10:52

Answer 4

+6 A:

A bit more like a SAX type parser is HTML::Parser:

use strict;
use warnings;

use English qw<$OS_ERROR>;
use HTML::Parser;
use List::Util qw<first>;

my $omitted;

sub tag_handler { 
    my ( $self, $tag_name, $text, $attr_hashref ) = @_;
    if ( $tag_name eq 'a' ) { 
        my $href = first {; defined } @$attr_hashref{ qw<href HREF> };
        $omitted = substr( $href, 0, 7 ) eq 'http://';
        return if $omitted;
    }
    print $text;
}

sub end_handler { 
    my $tag_name = shift;
    if ( $tag_name eq 'a' && $omitted ) { 
        $omitted = false;
        return;
    }
    print shift;
}

my $parser
    = HTML::Parser->new( api_version => 3
                       , default_h   => [ sub { print shift; }, 'text' ]
                       , start_h     => [ \&tag_handler, 'self,tagname,text,attr' ]
                       , end_h       => [ \&end_handler, 'tagname,text' ]
                       );
$parser->parse_file( $path_to_file ) or die $OS_ERROR;

Axeman 2009-10-21 02:14:46

+1 BTW, see http://www.perlfoundation.org/perl5/index.cgi?pbp_module_recommendation_commentary on `Smart::Comments`. I am not sure if I feel that strongly, but I am in general not a fan of source filters.

Sinan Ünür 2009-10-21 03:33:05

@Sinan Ünür: normally I remove debugging code from my finished answers. That's where Smart::Comments shines though, is debugging code.

Axeman 2009-10-21 04:36:38

That's not bad, but eventually does rely on another regular expression. However, HTML::Parser will give you attributes and their values if you ask nicely.

innaM 2009-10-21 08:21:39

@Manni: Agree--and I knew that it did that, but I didn't want to write a complicated pipe when I was altering the tag--but it makes a better solution, if I'm not writing out anything. I'm going to change it.

Axeman 2009-10-21 18:09:37

Answer 5

A:

Yet another solution. I love HTML::TreeBuilder and family.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new_from_file(\*DATA);
foreach my $a ($root->find_by_tag_name('a')) {
    if ($a->attr('href') !~ /^#/) {
        $a->replace_with_content($a->as_text);
    }
}
print $root->as_HTML(undef, "\t");

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com"&gt;An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Leonardo Herrera 2009-10-22 20:13:34

ansaurus

tags:

views:

answers:

How can I remove external links from HTML using Perl?

related questions