tags:

views:

121

answers:

5

I'm trying to do a match in Perl, using the following regex:

s/<font(.*?)>[\t\f ]*<\/font>//gi;

What I want it to do is to remove all font tags that don't have anything inside.

Unfortunately, it doesn't stop after <font at the first > it will go until the > from before </font>.

Any pointers on what is wrong with the regex?

my $text1 = '<font color="#008080"><span style="background: #ffffff"></span></font>';
my $text2 = '<font color="#008080">    s</font>';
my $text2 = '<font></font>';
$text1 =~ s/<font(.*?)>[\t\f ]*<\/font>//gi;
$text2 =~ s/<font(.*?)>[\t\f ]*<\/font>//gi;
$text3 =~ s/<font(.*?)>[\t\f ]*<\/font>//gi;
print "$text1\n$text2\n$text3\n";

will print

 
<font>s</font>
 
+6  A: 

Obligatory warning: You shouldn't use regex to parse HTML.


While .*? is lazy, it does not mean it will avoid a match to be succeed. In $text1,

<font color="#008080"><span style="background: #ffffff"></span></font>

it is possible to match <font(.*?)>[\t\f ]*<\/font> by having .*? match the " color="#008080"><span style="background: #ffffff"></span". This is the shortest match that will cause the match to succeed.

If you want to stop at the first >, use

s|<font[^>]*>\s*</font>||gi
#      ^^^^

This assumes a > won't appear inside a <font> tag. (Example violation: <font onclick="return 1>2"></font>.)

KennyTM
@downvoter: Please explain.
KennyTM
+1 for the link to that question.
Donal Fellows
+9  A: 

If you are using XHTML, then this is pretty easy with XML::Twig:

use XML::Twig;

my $string = <<"HTML";
<?xml version="1.0"?>
<html>
<font color="#008080"><span style="background: #ffffff"></span></font>
<font color="#008080">    s</font>
<font></font>
</html>
HTML

use XML::Twig;
my $twig = XML::Twig->new( 
    pretty_print => 'nice',
    twig_handlers => {
        span => \&delete_empty,
        font => \&delete_empty,
        },
    );
$twig->parse( $string );

$twig->print;

sub delete_empty {
    my( $twig, $element ) = @_;

    $element->delete unless $element->text =~ /\S/;
    }

You can also use HTML::Tree, but I don't have time to write an example right now (and now that I do, Greg Bacon has already done it). I don't show you how to do this particular task in my Process HTML with a Perl Module article for InformIT, but most of the pieces are there.

brian d foy
+2  A: 

I am really fond of HTML::TokeParser::Simple. So, for variety, here is another way:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new( \*DATA );

while ( my $stag = $parser->get_token ) {
    if ( $stag->is_start_tag( qr/font|span/ ) ) {
        my $closer = '/' . $stag->get_tag;
        my $text   = $parser->get_text( $closer );
        my $etag   = $parser->get_tag( $closer );

        if ( $text =~ /\S/ ) {
            $text =~ s/^\s+//;
            $text =~ s/\s+\z//;
            print $stag->as_is, $text, $etag->as_is;
        }
    }
    else {
        print $stag->as_is;
    }
}


__DATA__
<h1>Test heading</h1>
<p>Here is some <b>sample</b> <em>text</em>: <span>one</span>
<font color="#008080"><span style="background: #ffffff"></span></font>
<font color="#008080">    s</font>
<font></font></p>

<h2>A subtitle</h2>
<p><q>this is a test</q>: ya ba da ba doo!</p>
</body>

Output:

<h1>Test heading</h1>
<p>Here is some <b>sample</b> <em>text</em>: <span>one</span>

<font color="#008080">s</font>
</p>

<h2>A subtitle</h2>
<p><q>this is a test</q>: ya ba da ba doo!</p>
</body>
Sinan Ünür
+3  A: 

The code below uses the HTML::TreeBuilder module, which is an appropriate tool for parsing HTML. Regular expressions are not.

#! /usr/bin/perl

use warnings;
use strict;

use HTML::TreeBuilder;

The test cases from your question:

my @cases = (
  '<font color="#008080"><span style="background: #ffffff"></span></font>',
  '<font color="#008080">    s</font>',
  '<font></font>',
);

We'll use is_empty as a predicate to the look_down method of HTML::Element to find <font> elements with no interesting content.

sub is_empty {
  my($font) = @_;

  my $is_interesting = sub {
    for ($_[0]->content_list) {
      return 1 if !ref($_) && /\S/;
    }
  };

  !$font->look_down($is_interesting);
}

Finally the main loop. For each fragment, we create a new HTML::TreeBuilder instance, remove empty <font> elements, and trim the first-level text content of those that remain.

foreach my $html (@cases) {
  my $tree = HTML::TreeBuilder->new_from_content($html);
  $_->detach for $tree->guts->look_down(_tag => "font", \&is_empty);

  my $result = "";
  if ($tree->guts) {
    foreach my $font ($tree->guts->look_down(_tag => "font")) {
      $font->attr($_,undef) for $font->all_external_attr_names;
      foreach my $text ($font->content_refs_list) {
        next if ref $$text;
        $$text =~ s/^\s+//;
        $$text =~ s/\s+$//;
      }
    }

    ($result = $tree->guts->as_HTML) =~ s/\s+$//;
  }

  print "$result\n";
}

Output:

    
<font>s</font>

Making two passes is sloppy. The code could be improved:

#! /usr/bin/perl

use warnings;
use strict;

use HTML::TreeBuilder;

my @cases = (
  '<font color="#008080"><span style="background: #ffffff"></span></font>',
  '<font color="#008080">    s</font>',
  '<font></font>',
);

foreach my $fragment (@cases) {
  my $tree = HTML::TreeBuilder->new_from_content($fragment);
  foreach my $font ($tree->guts->look_down(_tag => "font")) {
    $font->detach, next
      unless $font->look_down(sub { grep !ref && /\S/ => $_[0]->content_list });

    $font->attr($_,undef) for $font->all_external_attr_names;
    foreach my $text ($font->content_refs_list) {
      next if ref $$text;
      $$text =~ s/^\s+//;
      $$text =~ s/\s+$//;
    }
  }

  (my $cleaned = $tree->guts ? $tree->guts->as_HTML : "") =~ s/\s+$//;
  print $cleaned, "\n";
}
Greg Bacon
I need to know: why do you do $font->detach?
cristi
@cristi The code uses it to delete empty `font` elements. According to the [documentation for `HTML::Element`](http://search.cpan.org/~jfearn/HTML-Tree-4.0/lib/HTML/Element.pm), `$h->detach()` “unlinks `$h` from its parent, by setting its ‘parent’ attribute to `undef`, and by removing it from the content list of its parent (if it had one).”
Greg Bacon
Thank you. An other question: why !ref in grep? this will also remove
cristi
Thank you. An other question: why !ref in grep? this will also remove <font><IMG SRC="asdf"> </font>
cristi
@cristi Maybe I misunderstood what you want. Should `$text1` (which has an empty `span`) from your original example remain or be deleted?
Greg Bacon
A: 
s/<font[^>]*>\s*<\/font>//gi;

The non-greedy .*? tries to consume a minimal number of characters, but it will take as many as necessary to achieve an overall match. If you replace it with [^>]*, the > has to match the very next >, or the match attempt fails.

Be aware that it's legal for > to appear in attribute values, so this solution isn't 100% guaranteed. Fortunately, the people who know about that little loophole are also sensible enough not to use it; I've never seen an angle bracket in an attribute value in the wild.

Alan Moore
Would the downvoter care to explain why?
Alan Moore