tags:

views:

917

answers:

4

I'm trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn't want with other regular expressions.

I'm having a problem with my regex not picking the correct pair of start and end tags to remove.

my $a = 'a <span>b <span style="color:red;">c</span> d</span>e';
$a =~ s/<span\s*>(.*?)<\/span>/$1/g;
print "$a\

returns

a b <span style="color:red;">c d</span>e

but I want it to return

a b <span style="color:red;">c</span> de

Help appreciated.

+4  A: 

Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:

a <span>b <span style="color:red;" title="being closed with </span>">c</span> de

Would you have thought of that?

Use an XML processor instead.

Also see the Related Questions (to the right) for your question.

ax
it's true, but luckily I have a known (albeit large) pile of cruddy html to clean up, and that isn't one of word's many evil tricks
Tim Abell
if it's about word: did you see this: Cleaning Word's Nasty HTML, http://www.codinghorror.com/blog/archives/000485.html ?
ax
HTML is not valid as `<` and `>` must be escaped in attributes, also second span is not closed
tig
i don't think that `<` and `>` in attribute values *must* be escaped. do you have any reference?
ax
+6  A: 

Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:

(<span[^>]*>.*+(?1)?.*+<\/span>)

See perlfaq 6.11.

Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed <span> start-tags, allowing the </span> end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.

bobince
+8  A: 

Try HTML::Parser:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my @print_span;
my $p = HTML::Parser->new(
  start_h   => [ sub {
    my ($text, $name, $attr) = @_;
    if ( $name eq 'span' ) {
      my $print_tag = %$attr;
      push @print_span, $print_tag;
      return if !$print_tag;
    }
    print $text;
  }, 'text,tagname,attr'],
  end_h => [ sub {
    my ($text, $name) = @_;
    if ( $name eq 'span' ) {
      return if !pop @print_span;
    }
    print $text;
  }, 'text,tagname'],
  default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;

__END__
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>This is a header</h1>
a <span>b <span style="color:red;">c</span> d</span>e
</body>
</html>
runrig
Thanks very much, that gives me lots to go on. I'll have to read up on how you've done this, and I've got some extra complications to handle, but hopefully this'll do the trick. The objective is actually to clean up word crap that's been pasted into an asp.net file with masterpages.
Tim Abell
A: 

With all your help I've published a script that does everything I need.

http://github.com/timabell/decrufter/

Tim Abell