views:

68

answers:

1

I researched this quite a bit, but couldn't find a working example how to match nested html tags with attributes. I know it is possible to match balanced/nested innermost tags without attributes (for example a regex for and would be #<div\b[^>]*>(?:(?> [^<]+ ) |<(?!div\b[^>]*>))*?</div>#x).

However, I would like to see a regex pattern that finds an html tag pair with attributes.

Example: It basically should match

<div class="aaa"> **<div class="aaa">** <div> <div> </div> **</div>** </div>

and not

<div class="aaa"> **<div class="aaa">** <div> <div> **</div>** </div> </div>

Anybody has some ideas?

For testing purposes we could use: http://www.lumadis.be/regex/test_regex.php


PS. Steven mentioned a solution in his blog (actually in a comment), but it doesn't work

http://blog.stevenlevithan.com/archives/match-innermost-html-element

$regex = '/<div\b[^>]+?\bid\s*=\s*"MyID"[^>]*>(?:((?:[^<]++|<(?!\/?div\b[^>]*>))+)|(<div\b[^>]*>(?>(?1)|(?2))*<\/div>))?<\/div>/i';
A: 

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

And indeed, it is absolutely impossible. HTML has something unique, something magical, which is immune to RegEx.

Time Machine
*something magical, which is immune to RegEx* == XML, HTML, and friends are no regular languages
Daniel Brückner
It's bad enough having to see *links* to The Rant in every other question; copying it is going too far. It isn't *that* funny, and more to the point, it isn't helpful.
Alan Moore
Just to clarify. This is more of a theoretical discussion, just for fun. Of course in real life I would use xpath or so.I understand that "finite state" or "true" regex are not able to do that, but what about the PHP/PCRE flavor of regex (which are not really "classical" regex anymore, for example they even support recursive patterns ?R).
Dave