ansaurus

Question

How can I remove all font tags that have nothing but whitespace in them, using Perl?

Answer 1

+6 A:

Obligatory warning: You shouldn't use regex to parse HTML.

While .*? is lazy, it does not mean it will avoid a match to be succeed. In $text1,

<font color="#008080"><span style="background: #ffffff"></span></font>

it is possible to match <font(.*?)>[\t\f ]*<\/font> by having .*? match the " color="#008080"></span". This is the shortest match that will cause the match to succeed.

If you want to stop at the first >, use

s|<font[^>]*>\s*</font>||gi
#      ^^^^

This assumes a > won't appear inside a  tag. (Example violation: 2">.)

KennyTM 2010-09-18 11:35:02

@downvoter: Please explain.

KennyTM 2010-09-18 11:45:57

+1 for the link to that question.

Donal Fellows 2010-09-18 12:14:06

Answer 2

+9 A:

If you are using XHTML, then this is pretty easy with XML::Twig:

use XML::Twig;

my $string = <<"HTML";
<?xml version="1.0"?>
<html>
<font color="#008080"><span style="background: #ffffff"></span></font>
<font color="#008080">    s</font>
<font></font>
</html>
HTML

use XML::Twig;
my $twig = XML::Twig->new( 
    pretty_print => 'nice',
    twig_handlers => {
        span => \&delete_empty,
        font => \&delete_empty,
        },
    );
$twig->parse( $string );

$twig->print;

sub delete_empty {
    my( $twig, $element ) = @_;

    $element->delete unless $element->text =~ /\S/;
    }

You can also use HTML::Tree, but I don't have time to write an example right now (and now that I do, Greg Bacon has already done it). I don't show you how to do this particular task in my Process HTML with a Perl Module article for InformIT, but most of the pieces are there.

brian d foy 2010-09-18 11:44:32

Answer 3

+2 A:

I am really fond of HTML::TokeParser::Simple. So, for variety, here is another way:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new( \*DATA );

while ( my $stag = $parser->get_token ) {
    if ( $stag->is_start_tag( qr/font|span/ ) ) {
        my $closer = '/' . $stag->get_tag;
        my $text   = $parser->get_text( $closer );
        my $etag   = $parser->get_tag( $closer );

        if ( $text =~ /\S/ ) {
            $text =~ s/^\s+//;
            $text =~ s/\s+\z//;
            print $stag->as_is, $text, $etag->as_is;
        }
    }
    else {
        print $stag->as_is;
    }
}


__DATA__
<h1>Test heading</h1>
<p>Here is some <b>sample</b> <em>text</em>: <span>one</span>
<font color="#008080"><span style="background: #ffffff"></span></font>
<font color="#008080">    s</font>
<font></font></p>

<h2>A subtitle</h2>
<p><q>this is a test</q>: ya ba da ba doo!</p>
</body>

Output:

<h1>Test heading</h1>
<p>Here is some <b>sample</b> <em>text</em>: <span>one</span>

<font color="#008080">s</font>
</p>

<h2>A subtitle</h2>
<p><q>this is a test</q>: ya ba da ba doo!</p>
</body>

Sinan Ünür 2010-09-18 13:18:22

Answer 4

+3 A:

The code below uses the HTML::TreeBuilder module, which is an appropriate tool for parsing HTML. Regular expressions are not.

#! /usr/bin/perl

use warnings;
use strict;

use HTML::TreeBuilder;

The test cases from your question:

my @cases = (
  '<font color="#008080"><span style="background: #ffffff"></span></font>',
  '<font color="#008080">    s</font>',
  '<font></font>',
);

We'll use is_empty as a predicate to the look_down method of HTML::Element to find  elements with no interesting content.

sub is_empty {
  my($font) = @_;

  my $is_interesting = sub {
    for ($_[0]->content_list) {
      return 1 if !ref($_) && /\S/;
    }
  };

  !$font->look_down($is_interesting);
}

Finally the main loop. For each fragment, we create a new HTML::TreeBuilder instance, remove empty  elements, and trim the first-level text content of those that remain.

foreach my $html (@cases) {
  my $tree = HTML::TreeBuilder->new_from_content($html);
  $_->detach for $tree->guts->look_down(_tag => "font", \&is_empty);

  my $result = "";
  if ($tree->guts) {
    foreach my $font ($tree->guts->look_down(_tag => "font")) {
      $font->attr($_,undef) for $font->all_external_attr_names;
      foreach my $text ($font->content_refs_list) {
        next if ref $$text;
        $$text =~ s/^\s+//;
        $$text =~ s/\s+$//;
      }
    }

    ($result = $tree->guts->as_HTML) =~ s/\s+$//;
  }

  print "$result\n";
}

Output:

    
<font>s</font>

Making two passes is sloppy. The code could be improved:

#! /usr/bin/perl

use warnings;
use strict;

use HTML::TreeBuilder;

my @cases = (
  '<font color="#008080"><span style="background: #ffffff"></span></font>',
  '<font color="#008080">    s</font>',
  '<font></font>',
);

foreach my $fragment (@cases) {
  my $tree = HTML::TreeBuilder->new_from_content($fragment);
  foreach my $font ($tree->guts->look_down(_tag => "font")) {
    $font->detach, next
      unless $font->look_down(sub { grep !ref && /\S/ => $_[0]->content_list });

    $font->attr($_,undef) for $font->all_external_attr_names;
    foreach my $text ($font->content_refs_list) {
      next if ref $$text;
      $$text =~ s/^\s+//;
      $$text =~ s/\s+$//;
    }
  }

  (my $cleaned = $tree->guts ? $tree->guts->as_HTML : "") =~ s/\s+$//;
  print $cleaned, "\n";
}

Greg Bacon 2010-09-18 13:50:12

I need to know: why do you do $font->detach?

cristi 2010-10-14 20:15:56

@cristi The code uses it to delete empty `font` elements. According to the [documentation for `HTML::Element`](http://search.cpan.org/~jfearn/HTML-Tree-4.0/lib/HTML/Element.pm), `$h->detach()` “unlinks `$h` from its parent, by setting its ‘parent’ attribute to `undef`, and by removing it from the content list of its parent (if it had one).”

Greg Bacon 2010-10-15 00:49:07

Thank you. An other question: why !ref in grep? this will also remove

cristi 2010-10-21 11:12:59

Thank you. An other question: why !ref in grep? this will also remove <IMG SRC="asdf">

cristi 2010-10-21 11:13:19

@cristi Maybe I misunderstood what you want. Should `$text1` (which has an empty `span`) from your original example remain or be deleted?

Greg Bacon 2010-10-21 12:41:24

Answer 5

A:

s/<font[^>]*>\s*<\/font>//gi;

The non-greedy .*? tries to consume a minimal number of characters, but it will take as many as necessary to achieve an overall match. If you replace it with [^>]*, the > has to match the very next >, or the match attempt fails.

Be aware that it's legal for > to appear in attribute values, so this solution isn't 100% guaranteed. Fortunately, the people who know about that little loophole are also sensible enough not to use it; I've never seen an angle bracket in an attribute value in the wild.

Alan Moore 2010-09-18 15:36:04

Would the downvoter care to explain why?

Alan Moore 2010-09-19 15:46:34

ansaurus

tags:

views:

answers:

How can I remove all font tags that have nothing but whitespace in them, using Perl?

related questions