tags:

views:

82

answers:

5

Pre-scriptum: I'm purely curious, and am aware of other perfectly suitable solutions to this, that lie outside the domain of regular expressions.

How do I match from a beginning tag, and until a closing tag with possible nested, and perhaps identical tags. So say I have given in an HTML file:

<div class="nice">
    <a href="http://www.google.com"&gt;Hello&lt;/a&gt;
    <div>World</div>
</div>

Let's say I want to comment that out via regex replace. One could do a simple

/(<div\sclass=\"nice\">(.*)</div>)/

But that would of course match until the VERY LAST closing div tag, rendering the code foul if the nice div is embedded inside another div. Making the delimiter non-greedy would render the code foul even more, matching until the VERY FIRST closing div tag.

So any ideas? I've often thought about this, and I've never found a solution, is this impossible in regex, or is it just me that's forgetting something simple? Isn't there some sort of "look-back" mechanism?

+2  A: 

The usual advice is not use to regexps for HTML, since HTML isn't regular. So attempts to parse it using regular expressions (especially for doing something rigorous like the above) is going to be fraught with difficulty.

Brian Agnew
The fact that HTML isn't regular does not necessarily imply that it cannot be parsed with the modern implementations of regexes which have surpassed the capabilities needed to parse a regular language, e.g. backreferencing.
oleks
+3  A: 

It's impossible in a regex.

Use an HTML parser instead, like Beautiful Soup, html5lib, hpricot, or nokogiri

Brian Campbell
+1 for supplying the canonical answer to the question that is asked every other day by someone who hasn't found the `search` box yet.
Greg D
+1  A: 

.NET's Regex implementation is one of the few that can handle this scenario. It offers a balanced matching feature where groups can be used and counted to parse nested patterns.

However, this still isn't a perfect solution. For example, if you throw an ill-placed html comment into the mix then even a clever regex w/ balanced matching can fail. So it's still better to use an html parser.

Steve Wortham
A: 

Balanced matching seems to be the very right tool for this, and presumably can be implemented in many languages, but Perl and .NET make the best attempts as far as I can see. As Perl has the simplest example, here's one (borrowed from http://www.perl.com/pub/a/2003/06/06/regexps.html):

$paren = qr/
      \(
        ( 
           [^()]+  # Not parens
         | 
           (??{ $paren })  # Another balanced group (not interpolated yet)
        )*
      \)
    /x;

The (??{ $paren }) simply refers to the regex itself resulting in a recursive regex. Beautiful, I guess I should've mentioned that I was open to solutions like this, but of course, this is not at all a purely regular expressions example, which of course is impossible by definition :)

oleks
A: 

As others have said, it's generally a bad idea. But you said you were just asking out of curiosity, so here goes...

Your problem is impossible to solve with the traditional concept of regex, but some engines, like .NET's, cheat a little and give you a way to do it with a "balancing group definition".

Here's a tutorial: http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx

lonekorean