ansaurus

Question

How can I remove unused, nested HTML span tags with a Perl regex?

Answer 1

+4 A:

Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:

a <span>b <span style="color:red;" title="being closed with </span>">c</span> de

Would you have thought of that?

Use an XML processor instead.

Also see the Related Questions (to the right) for your question.

ax 2009-03-20 17:54:33

it's true, but luckily I have a known (albeit large) pile of cruddy html to clean up, and that isn't one of word's many evil tricks

Tim Abell 2009-03-23 14:27:26

if it's about word: did you see this: Cleaning Word's Nasty HTML, http://www.codinghorror.com/blog/archives/000485.html ?

ax 2009-03-23 14:37:30

HTML is not valid as `<` and `>` must be escaped in attributes, also second span is not closed

tig 2010-02-04 16:25:03

i don't think that `<` and `>` in attribute values *must* be escaped. do you have any reference?

ax 2010-02-04 21:43:24

Answer 2

+6 A:

Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:

(<span[^>]*>.*+(?1)?.*+<\/span>)

See perlfaq 6.11.

Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed <span> start-tags, allowing the </span> end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.

bobince 2009-03-20 17:57:36

Answer 3

+8 A:

Try HTML::Parser:

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my @print_span;
my $p = HTML::Parser->new(
  start_h   => [ sub {
    my ($text, $name, $attr) = @_;
    if ( $name eq 'span' ) {
      my $print_tag = %$attr;
      push @print_span, $print_tag;
      return if !$print_tag;
    }
    print $text;
  }, 'text,tagname,attr'],
  end_h => [ sub {
    my ($text, $name) = @_;
    if ( $name eq 'span' ) {
      return if !pop @print_span;
    }
    print $text;
  }, 'text,tagname'],
  default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;

__END__
<html>
<head>
<title>This is a title</title>
</head>
<body>
<h1>This is a header</h1>
a <span>b <span style="color:red;">c</span> d</span>e
</body>
</html>

runrig 2009-03-20 18:41:38

Thanks very much, that gives me lots to go on. I'll have to read up on how you've done this, and I've got some extra complications to handle, but hopefully this'll do the trick. The objective is actually to clean up word crap that's been pasted into an asp.net file with masterpages.

Tim Abell 2009-03-23 13:52:48

Answer 4

A:

With all your help I've published a script that does everything I need.

http://github.com/timabell/decrufter/

Tim Abell 2009-03-23 16:29:54

ansaurus

tags:

views:

answers:

How can I remove unused, nested HTML span tags with a Perl regex?

related questions