ansaurus

Question

Regex to match all HTML tags except and

Answer 1

+1 A:

Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:

/<\/?[^p][^>]*>/

EDIT:

But that won't match a <pre> or <param> tag, unfortunately.

This, perhaps?

/<\/?(?!p>|p )[^>]+>/

That should cover  tags that have attributes, too.

Brian Warshaw 2008-08-27 10:45:36

Answer 2

A:

Try this, it should work:

/<\/?([^p](\s.+?)?|..+?)>/

Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).

/EDIT: I've added the ability to handle attributes in p tags.

Konrad Rudolph 2008-08-27 10:47:17

Answer 3

+2 A:

Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.

Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.

So as I say, I don't really think regexps are the right tool for the job.

DrPizza 2008-08-27 10:53:29

Answer 4

+1 A:

Since HTML is not a regular language

HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

Konrad Rudolph 2008-08-27 10:54:58

Answer 5

A:

You should probably also remove any attributes on the tag, since someone bad could do something like:

<p onclick="document.location.href='http://www.evil.com'"&gt;Clickable text</p>

The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with tags without attributes. Just to be on the safe side.

Vegard Larsen 2008-08-27 11:13:39

Answer 6

+1 A:

I came up with this:

<(?!\/?p(?=>|\s.*>))\/?.*?>

x/
<     # Match open angle bracket
(?!   # Negative lookahead (Not matching and not consuming)
    \/?  # 0 or 1 /
    p     # p
    (?=  # Positive lookahead (Matching and not consuming)
    >  # > - No attributes
        |  # or
    \s  # whitespace
    .*  # anything up to 
    >  # close angle brackets - with attributes
    )     # close positive lookahead
)     # close negative lookahead
      # if we have got this far then we don't match
      # a p tag or closing p tag
      # with or without attributes
\/?   # optional close tag symbol (/)
.*?   # and anything up to
>     # first closing tag
/

This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

Xetius 2008-08-27 11:26:12

Answer 7

+3 A:

Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't :

(<[^pP].*?>|</[^pP]>)

Verbose:

(
    <               # < opening tag
        [^pP].*?    # p non-p character, then non-greedy anything
    >               # > closing tag
|                   #   ....or....
    </              # </
        [^pP]       # a non-p tag
    >               # >
)

dbr 2008-08-27 12:17:14

Answer 8

+20 A:

If you insist on using a regex, something like this will work in most cases:

# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;

Explanation:

s{
  <             # opening angled bracket
  (?>/?)        # ratchet past optional / 
  (?:
    [^pP]       # non-p tag
    |           # ...or...
    [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
  )
  [^>]*         # everything until closing angled bracket
  >             # closing angled bracket
 }{}gx; # replace with nothing, globally

But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:

use strict;

use HTML::TokeParser;

my $parser = HTML::TokeParser->new('/some/file.html')
  or die "Could not open /some/file.html - $!";

while(my $t = $parser->get_token)
{
  # Skip start or end tags that are not "p" tags
  next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');

  # Print everything else normally (see HTML::TokeParser docs for explanation)
  if($t->[0] eq 'T')
  {
    print $t->[1];
  }
  else
  {
    print $t->[-1];
  }
}

HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

John Siracusa 2008-08-27 12:31:35

Save yourself even further headache and use the excellent HTML::TokeParser::Simple module. :-)

Aristotle Pagaltzis 2008-09-19 12:37:16

Answer 9

+1 A:

You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but is perfectly valid HTML.

Kibbee 2008-08-27 13:11:04

Answer 10

+11 A:

In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

For example, this:

<HTML /
  <HEAD /
    <TITLE / > /
    <P / >

is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

It is semantically equivalent to

<html>
  <head>
    <title>
      &gt;
    </title>
  </head>
  <body>
    <p>
      &gt;
    </p>
  </body>
</html>

But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

Jörg W Mittag 2008-08-27 14:01:27

Wow. I didn't believe you, but I ran it through the W3 validator with an HTML 4.01 Strict doctype, and it validates. It throws up warnings, but wow.

eyelidlessness 2009-11-22 09:20:05

Answer 11

A:

HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

That may be true, but your regexp is not evidence of that, since it doesn't work. Other expressions posted to the thread seem to fare better, but they're not regular expressions any more. And they still don't cope with the full range of legal HTML encodings. regexps just aren't the right tool for the job. Use a proper SGML or HTML parser.

DrPizza 2008-08-27 20:59:58

Answer 12

+1 A:

The original regex can be made to work with very little effort:

 <(?>/?)(?!p).+?>

The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.

(That said I agree that generally parsing HTML with regexes is not the way to go).

moritz 2008-09-19 09:26:46

Answer 13

+1 A:

Hello,

I used Xetius regex and it works fine. Except for some flex generated tags which can be :

with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :

<(?!\/?p(?=>|\s?.*>))\/?.*?>

I'm using it to clear tags from flex generated html text so i also added more excepted tags :

<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>

y_nk 2010-05-28 10:15:31

ansaurus

tags:

views:

answers:

Regex to match all HTML tags except <p> and </p>

related questions