ansaurus

Question

Regex to match all HTML tags and tag content except <p> and </p>

Answer 1

+1 A:

Sometimes it's easier to combine a regex with a little extra checking.

So \<.*?\>.*?\<\/.*?\> should match anything between <..>...</..> like tags. You can then programmatically check that the <..> and </...> are not <p> and </p> respectively. It'd probably be easiest to check this if you group them ala:

(\<.*?\>).*?(\<\/.*?\>)

then check to make sure $1 and $2 (or however you do backreferences in your environment) aren't the paragraph open and close tags.

.*? specifies minimal matching; I'm assuming your regex environment supports that.

Arnshea 2009-03-20 20:38:18

Answer 2

+7 A:

Parsing HTML using regular expressions is very very hard and painful.

You're better off using some sort of DOM-based parser and finding the elements you need.

Mark Biek 2009-03-20 20:38:29

+1. Regex cannot parse HTML, as already mentioned in the 2000 questions already posted asking about parsing HTML with regex.

bobince 2009-03-20 21:13:49

It's good. Eventually every single Google hit for "parse html regex" will point to one of these questions talking about why it's a bad idea.

Mark Biek 2009-03-20 23:22:32

Answer 3

A:

You haven't said what you're trying to do, but there's a good change you're better off with using the XmlParse function to create an XML DOM, and working on that instead.

Peter Boughton 2009-03-20 21:43:58

Answer 4

A:

Does this work? I only did a few checks on it, but it seems to:

Regex expr = new Regex(@"<([A-OQ-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>", RegexOptions.IgnoreCase);

I just copied & pasted the C# code. To get everything between the tags, you need to use \1, and then you need to turn off case sensitivity, so IgnoreCase, or -i, or whatever tool you're using provides that option. If your tool doesn't do this, then you will have to do A-Oa-oq-zQ-z etc. Just the regex:

<([A-OQ-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>

Note this will note match standalone tags, but should get you started.

Nick 2009-03-30 21:59:29

ansaurus

tags:

views:

answers:

Regex to match all HTML tags and tag content except <p> and </p>

related questions