views:

584

answers:

4

I am looking for a regex to match all HTML tags, except <p> and </p> that includes the tag content. I am developing in ColdFusion.

There was an earlier post about matching tags except <p> and </p>, but I need to grab everything between the tags as well. For instance, the following should match in their entirety:

<a href="http://www.google.com"&gt;Google&lt;/a&gt;

and

<em>Some text here</em>

but not

<p>Some text and tags here</p>

Any ideas on how to accomplish this?

+1  A: 

Sometimes it's easier to combine a regex with a little extra checking.

So \<.*?\>.*?\<\/.*?\> should match anything between <..>...</..> like tags. You can then programmatically check that the <..> and </...> are not <p> and </p> respectively. It'd probably be easiest to check this if you group them ala:

(\<.*?\>).*?(\<\/.*?\>)

then check to make sure $1 and $2 (or however you do backreferences in your environment) aren't the paragraph open and close tags.

.*? specifies minimal matching; I'm assuming your regex environment supports that.

Arnshea
+7  A: 

Parsing HTML using regular expressions is very very hard and painful.

You're better off using some sort of DOM-based parser and finding the elements you need.

Mark Biek
+1. Regex cannot parse HTML, as already mentioned in the 2000 questions already posted asking about parsing HTML with regex.
bobince
It's good. Eventually every single Google hit for "parse html regex" will point to one of these questions talking about why it's a bad idea.
Mark Biek
A: 

You haven't said what you're trying to do, but there's a good change you're better off with using the XmlParse function to create an XML DOM, and working on that instead.

Peter Boughton
A: 

Does this work? I only did a few checks on it, but it seems to:

Regex expr = new Regex(@"<([A-OQ-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>", RegexOptions.IgnoreCase);

I just copied & pasted the C# code. To get everything between the tags, you need to use \1, and then you need to turn off case sensitivity, so IgnoreCase, or -i, or whatever tool you're using provides that option. If your tool doesn't do this, then you will have to do A-Oa-oq-zQ-z etc. Just the regex:

<([A-OQ-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

Note this will note match standalone tags, but should get you started.

Nick