views:

5550

answers:

10

Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times. For example, can a regular expression match an opening and closing brace when there are an unknown number of open closing braces nested within the outer braces.

For example:

public MyMethod()
{
  if (test)
  {
    // More { }
  }

  // More { }
} // End

Should match:

{
  if (test)
  {
    // More { }
  }

  // More { }
}
+4  A: 

No, you are getting into the realm of Context Free Grammers at that point

Craig H
+2  A: 

No. You need a full-blown parser for this type of problem.

Adam Rosenfield
... or Perl5.10 or higher
Brad Gilbert
+33  A: 

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia

Torsten Marek
I expected as much. Thanks for the detailed answer.
Richard Dorman
Torsten is correct as far as theory is concerned.In practice many implementations have some trick in order to allow you to perform recursive "regular expressions". E.g. see the chapter "Recursive patterns" in http://php.net/manual/en/regexp.reference.php
daremon
I am spoiled by my upbringing in Natural Language Processing and the automata theory it included.
Torsten Marek
A refreshingly clear answer. Best "why not" I've ever seen.
Ben Doom
I agree with Ben completely: excellent answer.
Adam Bernier
Regular expressions in language theory and regular expressions in practice are different beasts... since _regular_ expressions can't have niceties such as back references, forward references etc.
Novikov
+6  A: 

Probably working perl solution, if the string is on one line:

my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;

if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
    print "Found: $1\n" ;
  }

HTH

EDIT: check:

And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):

Zsolt Botykai
Thanks, removed my earlier comment. Sadly, I'm out of votes for today:(
Torsten Marek
Yup. Perl's "regular expressions" aren't (and haven't been for a very long time). It should be noted that recursive regexes are a new feature in Perl 5.10 and that even though you can do this you probably shouldn't in most of the cases that commonly come up (e.g. parsing HTML).
Michael Carman
http://perldoc.perl.org/perlretut.html
Brad Gilbert
+2  A: 

The Pumping lemma for regular languages is the reason why you can't do that.

The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.

In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).

Rafał Dowgird
+1  A: 

Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.

Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.

For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"

Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.

So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.

Remo.D
+1  A: 

as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm

+2  A: 

Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details http://retkomma.wordpress.com/2007/10/30/nested-regular-expressions-explained/

As others have mentioned, .NET is _not_ the only capable regex engine to do this.
Ben S
A: 

This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m

Sean Huber
A: 

Using regular expressions to check for nested patterns is very easy.

'/(\((?>[^()]+|(?1))*\))/'
MichaelRushton