tags:

views:

316

answers:

3

Lets say I need to get a string inside some h1, h2, or h3 tags

/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/

This works great if the user decides to take a sane approach to headers:

<h1>My Header</h1>

but knowing my users, they want bold, italic, underlined h1's. And they have that coding quagmire tinyMCE to help them do it. TinyMCE would output:

<h1><b><span style='text-decoration: underline'><i>My Hideous Header</i></span></b></h1>

So my question is:

How do i get a string inside h1 h2, or h3, and then inside any amount of surrounding other tags as well?

Thanks, Joe

+3  A: 
/<(h[1-3])[^>]*>(?:.*?>)?([^<]+)(?:<.*?)?<\/\1>/i

It will not be too hard to make cases that break it hideously, since (as I'm sure people will tell you) parsing HTML is a job for an HTML parser, not a regex, but it works for your given case and various similar ones.

chaos
+1, especially for the "don't use regex for this" comment
Simon Nickerson
+1 for the same reasons as simonn!
TrueWill
A: 

If you only want to capture the ultimately nested text you could just drop all tags inside the header tag with:

/<([hH][1-3]).*>(.*?)<.*\/$1>/

Untested, but I think it should work.

Swish
Nope. `(.*?)` is allowed to match nothing, and thanks to the greedy `.*` ahead of it, that's exactly what it does.
Alan Moore
+1  A: 

If you're in php you can use your regex:

/<[hH][1-3][^>]*>(.*?)<\/[hH][1-3]>/

then pass the captured result through strip_tags() function to get rid of all the insanity inside.

If you are not on php you can pass the result through regexp replace that removes tags. Something like replace /<\/?[^>]+?>/ with empty string.

Kamil Szot