tags:

views:

8890

answers:

13

I need to match and remove all tags using a regular expression in Perl. I have the following:

<\\??(?!p).+?>

But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?

Note, this is being performed on xhtml.

+1  A: 

Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:

/<\/?[^p][^>]*>/

EDIT:

But that won't match a <pre> or <param> tag, unfortunately.

This, perhaps?

/<\/?(?!p>|p )[^>]+>/

That should cover <p> tags that have attributes, too.

Brian Warshaw
A: 

Try this, it should work:

/<\/?([^p](\s.+?)?|..+?)>/

Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).

/EDIT: I've added the ability to handle attributes in p tags.

Konrad Rudolph
+2  A: 

Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.

Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.

So as I say, I don't really think regexps are the right tool for the job.

DrPizza
+1  A: 

Since HTML is not a regular language

HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

Konrad Rudolph
A: 

You should probably also remove any attributes on the <p> tag, since someone bad could do something like:

<p onclick="document.location.href='http://www.evil.com'"&gt;Clickable text</p>

The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.

Vegard Larsen
+1  A: 

I came up with this:

<(?!\/?p(?=>|\s.*>))\/?.*?>

x/
<     # Match open angle bracket
(?!   # Negative lookahead (Not matching and not consuming)
    \/?  # 0 or 1 /
    p     # p
    (?=  # Positive lookahead (Matching and not consuming)
    >  # > - No attributes
        |  # or
    \s  # whitespace
    .*  # anything up to 
    >  # close angle brackets - with attributes
    )     # close positive lookahead
)     # close negative lookahead
      # if we have got this far then we don't match
      # a p tag or closing p tag
      # with or without attributes
\/?   # optional close tag symbol (/)
.*?   # and anything up to
>     # first closing tag
/

This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

Xetius
+3  A: 

Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:

(<[^pP].*?>|</[^pP]>)

Verbose:

(
    <               # < opening tag
        [^pP].*?    # p non-p character, then non-greedy anything
    >               # > closing tag
|                   #   ....or....
    </              # </
        [^pP]       # a non-p tag
    >               # >
)
dbr
+20  A: 

If you insist on using a regex, something like this will work in most cases:

# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;

Explanation:

s{
  <             # opening angled bracket
  (?>/?)        # ratchet past optional / 
  (?:
    [^pP]       # non-p tag
    |           # ...or...
    [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
  )
  [^>]*         # everything until closing angled bracket
  >             # closing angled bracket
 }{}gx; # replace with nothing, globally

But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:

use strict;

use HTML::TokeParser;

my $parser = HTML::TokeParser->new('/some/file.html')
  or die "Could not open /some/file.html - $!";

while(my $t = $parser->get_token)
{
  # Skip start or end tags that are not "p" tags
  next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');

  # Print everything else normally (see HTML::TokeParser docs for explanation)
  if($t->[0] eq 'T')
  {
    print $t->[1];
  }
  else
  {
    print $t->[-1];
  }
}

HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

John Siracusa
Save yourself even further headache and use the excellent HTML::TokeParser::Simple module. :-)
Aristotle Pagaltzis
+1  A: 

You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.

Kibbee
+11  A: 

In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

For example, this:

<HTML /
  <HEAD /
    <TITLE / > /
    <P / >

is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

It is semantically equivalent to

<html>
  <head>
    <title>
      &gt;
    </title>
  </head>
  <body>
    <p>
      &gt;
    </p>
  </body>
</html>

But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

Jörg W Mittag
Wow. I didn't believe you, but I ran it through the W3 validator with an HTML 4.01 Strict doctype, and it validates. It throws up warnings, but wow.
eyelidlessness
A: 

HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

That may be true, but your regexp is not evidence of that, since it doesn't work. Other expressions posted to the thread seem to fare better, but they're not regular expressions any more. And they still don't cope with the full range of legal HTML encodings. regexps just aren't the right tool for the job. Use a proper SGML or HTML parser.

DrPizza
+1  A: 

The original regex can be made to work with very little effort:

 <(?>/?)(?!p).+?>

The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.

(That said I agree that generally parsing HTML with regexes is not the way to go).

moritz
+1  A: 

Hello,

I used Xetius regex and it works fine. Except for some flex generated tags which can be :

with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :

<(?!\/?p(?=>|\s?.*>))\/?.*?>

I'm using it to clear tags from flex generated html text so i also added more excepted tags :

<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>
y_nk