ansaurus

Question

Matching nesting tags

Answer 1

+2 A:

The usual advice is not use to regexps for HTML, since HTML isn't regular. So attempts to parse it using regular expressions (especially for doing something rigorous like the above) is going to be fraught with difficulty.

Brian Agnew 2010-01-05 21:49:31

The fact that HTML isn't regular does not necessarily imply that it cannot be parsed with the modern implementations of regexes which have surpassed the capabilities needed to parse a regular language, e.g. backreferencing.

oleks 2010-01-05 22:09:18

Answer 2

+3 A:

It's impossible in a regex.

Use an HTML parser instead, like Beautiful Soup, html5lib, hpricot, or nokogiri

Brian Campbell 2010-01-05 21:49:37

+1 for supplying the canonical answer to the question that is asked every other day by someone who hasn't found the `search` box yet.

Greg D 2010-01-05 21:52:48

Answer 3

+1 A:

.NET's Regex implementation is one of the few that can handle this scenario. It offers a balanced matching feature where groups can be used and counted to parse nested patterns.

However, this still isn't a perfect solution. For example, if you throw an ill-placed html comment into the mix then even a clever regex w/ balanced matching can fail. So it's still better to use an html parser.

Steve Wortham 2010-01-05 21:55:49

Answer 4

A:

Balanced matching seems to be the very right tool for this, and presumably can be implemented in many languages, but Perl and .NET make the best attempts as far as I can see. As Perl has the simplest example, here's one (borrowed from http://www.perl.com/pub/a/2003/06/06/regexps.html):

$paren = qr/
      \(
        ( 
           [^()]+  # Not parens
         | 
           (??{ $paren })  # Another balanced group (not interpolated yet)
        )*
      \)
    /x;

The (??{ $paren }) simply refers to the regex itself resulting in a recursive regex. Beautiful, I guess I should've mentioned that I was open to solutions like this, but of course, this is not at all a purely regular expressions example, which of course is impossible by definition :)

oleks 2010-01-05 22:37:38

Answer 5

A:

As others have said, it's generally a bad idea. But you said you were just asking out of curiosity, so here goes...

Your problem is impossible to solve with the traditional concept of regex, but some engines, like .NET's, cheat a little and give you a way to do it with a "balancing group definition".

Here's a tutorial: http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx

lonekorean 2010-01-05 22:40:35

ansaurus

tags:

views:

answers:

Matching nesting tags

related questions