I'm trying to capture certain parts of HTML using regular expressions, and I've come across a situation which I don't know how to resolve.
I've got an HTML fragment like this:
<span ...> .... <span ...> ... </span> ... </span>
so, a <span>
element into which another <span>
element is nested.
I've been successfully using the following regex (in PHP's preg_match()
/ preg_match_all()
) to capture entire HTML elements:
@<sometag[^>]+>.*?</sometag>@
This would capture a given starting tag and everything up to the closing tag of the same type.
However, in the situation above, this would capture the starting <span>
and everything up to the next closing </span>
encountered, so what I get is this:
<span ...> .... <span ...> ... </span>
that is, the outer starting tag, then everything until the starting tag of the inner span, then everything up to the closing tag of the inner span, which, of course, is not what I want.
What I really wanted is the outer <span>
element complete with everything that is inside it, including the inner nested <span>
.
Is there any practical way to achieve this?
Note: parsing the HTML using an XML parser is probably not an option, as the HTML I'm working on is old and very broken HTML 4 coming out of MS FrontPage that any parser would choke on.
Thanks for any help!