tags:

views:

822

answers:

8

In a PHP script, what regex should I use to check for mismatched parentheses in a string? Things that I want to allow include:

  • This is (ok)
  • This (is) (ok)

Things I want to prevent:

  • This is )bad(
  • This is also (bad
  • This is (bad (too)

Thanks!

Update: You guys all rock. Doing this with a regex seemed trickier than it should have, and these kinds of 2nd level answers are what makes stackoverflow beautiful. Thanks for the links and the pseudocode. I'm not sure who to give the answer to, so I apologize to everyone whose answers I can't accept.

+7  A: 

It is not possible to accomplish this with a regex. Brace matching requires a recursive/counting feature that is not available in a regex. You'll need a parser for this.

More details available here: http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

JaredPar
Great link. Thanks Jared.
James Socol
Recursion and counting ARE available features in some languages' regular expression engines, e.g. Perl's and PHP's. See Bennett McElwee's answer. Perl has a Regexp::Common::balanced module in CPAN that does this.
Brian Carper
@Brian, this is more of a semantic argument but at the point a regex can do recursion, it's not actually a regex anymore. It's a context free grammar. I realize this really means nothing to most people though and recursion is quickly becoming a must have feature for regexes.
JaredPar
You're right, but I think in context of the OP's question, it's important to point out that whatever PHP calls a "regex" can actually do what he's asking it to do. (Let's hope they are still called "regex" for a while, cfgex doesn't roll off the tongue as nicely. :)
Brian Carper
@Brian, I actually didn't realize that PHP regex's have recursive capabilities (usually work with the .Net version). cregex is awful. I like modregex but that would confuse perl guys. Maybe rrregex (recursive ready regex). You'd have to use a tony the tiger frosted flakes voice to say it though.
JaredPar
+2  A: 

To extend JaredPar's answer, it's not very difficult to solve without using a regex, just write a function that examines each character in the string and increments/decrements a counter. If you find a "(", increment it, and if you find a ")", decrement it. If the counter ever goes below 0, you can break, the string is invalid. When you've processed the whole string, if the counter isn't 0, there was an unmatched open parenthesis.

Chad Birch
A: 

What Jared said... Can't do it. I don't even think you can do it with the #1 #2 stuff that Perl gives you...

Brian Postow
+12  A: 

Regex is not the right tool for the job. Scan a string manually.

Pseudo-code:

depth = 0
for character in some_string:
    depth += character == '('
    depth -= character == ')'
    if depth < 0:
       break

if depth != 0:
   print "unmatched parentheses"
J.F. Sebastian
+1  A: 

Your examples don't include any nested parentheses… if you aren't concerned with nesting, then this can be done using the following expression:

^[^()]*(?:\([^()]*\)[^()]*)*$

This will match against all the strings in your "allow" list and fail against the strings in your "prevent" list. However, it will also fail against any string with nested parentheses. e.g. "this (is (not) ok)"

As others have already pointed out, regular expressions are not the correct tool if you need to handle nesting.

Ben Blank
+2  A: 

Agree with the fact that this is impossible with a REGEX. You could do the following, though:

<?php

$testStrings = array( 'This is (ok)', 'This (is) (ok)', 'This is )bad(', 'This is also (bad', 'This is (bad (too)' );

foreach( $testStrings as $string ) {
    $passed = hasMatchedParentheses( $string ) ? 'passed' : 'did not pass';
    echo "The string $string $passed the check for matching parenthesis.\n";
}

function hasMatchedParentheses( $string ) {
    $counter = 0;
    $length = strlen( $string );
    for( $i = 0; $i < $length; $i ++ ) {
     $char = $string[ $i ];
     if( $char == '(' ) {
      $counter ++;
     } elseif( $char == ')' ) {
      $counter --;
     }
     if( $counter < 0 ) {
      return false;
     }
    }
    return $counter == 0;
}

?>
nickohrn
+1  A: 

Why it's not possible with a regex

The other answers are all correct, but I just want to put in a plug for theoretical computer science... this is a case where knowing the theory gives an actual practical advantage.

A regex corresponds to a deterministic finite automaton (DFA), but paren matching require a context-free grammar, which can be realized as a finite automaton (PDA) but not by a DFA.

Because of this, without a lot of extra brain-work, we know that the answer is no, and we don't have to worry that there is something we're just overlooking. So, you can be confident in the above answers, and not worry that the authors are just overlooking something when they give that answer.

Almost all compiler books will talk about this, here's a quick overview:

http://books.google.com/books?id=4LMtA2wOsPcC&amp;pg=PA94&amp;lpg=PA94&amp;dq=push-down+finite+automata&amp;source=bl&amp;ots=NisYwNO1r0&amp;sig=ajaSHFXwpPOWG8IfbcfKoqzS5Wk&amp;hl=en&amp;ei=m26cSdf6DZGYsAPB-6SsAg&amp;sa=X&amp;oi=book_result&amp;resnum=6&amp;ct=result

Mark Harrison
This is not quite true. Regexes, as used in PHP and many other contexts, are not actually regular expressions in the formal sense. They add a whole lot of features that make them more useful and extend them beyond what real regular expressions can do. This is why the problem is solvable.
Bennett McElwee
Hmm, what feature? Do you have an example that will work on "((()))", "()()()", "(((", or "()))"?
Mark Harrison
The recursive subpattern feature. See my answer to this question, which correctly handles the four examples you give.
Bennett McElwee
Bennett, sweet!!
Mark Harrison
+8  A: 

You can do this with a regular expression -- PCRE, as used by PHP, allows recursive patterns. The PHP Manual gives an example that is almost exactly what you want:

\(((?>[^()]+)|(?R))*\)

This matches any correctly parenthesised substring as long as it begins and ends with parentheses. If you want to ensure the entire string is balanced, allowing strings like "wiggedy(wiggedy)(wiggedy(wack))", here's what I came up with:

^((?:[^()]|\((?1)\))*+)$

Here's an explanation of the pattern, which may be more illuminating than obfuscatory:

^             Beginning of the string
(             Start the "balanced substring" group (to be called recursively)
  (?:         Start the "minimal balanced substring" group
    [^()]     Minimal balanced substring is either a non-paren character
    |         or
    \((?1)\)  a set of parens containing a balanced substring
  )           Finish the "minimal balanced substring" group
  *           Our balanced substring is a maximal sequence of minimal
              balanced substrings
  +           Don't backtrack once we've matched a maximal sequence
)             Finish the "balanced substring" pattern
$             End of the string

There are lots of considerations of efficiency and correctness that come up with these sorts of regexes. Be careful.

Bennett McElwee
+1. Perl can do it too via the (??{ code }) extension. Many (most?) modern languages' "regular expression" engines are strictly more powerful than regular languages / DFAs.
Brian Carper
Perl has the same (?R) subgroup. But this feature, as well as (?? { code }), are highly experimental, and the documentation states that their exact side effects are subject to change between versions. So for production code, don't try to use hacks in the finite automaton (PCRE).
Chris Lutz
Chris, do you have a reference for the "highly experimental" warning? The PCRE man page at http://www.pcre.org/pcre.txt doesn't say this.
Bennett McElwee
This is potentially a *very* dangerous feature. Use of the recursion flag can both cause catastrophic (PCRE-crashing) backtracking, as well as excessive (script-terminating) memory consumption if you use capturing groups. For a simple case like this, it is *far* better to not use regexes at all!
Ben Blank
That's is true of regexes in general. Catastrophic backtracking can easily occur without using recursion. The best thing about regexes is their incredible power and flexibility. The worst thing about regexes is their incredible power and flexibility.
Bennett McElwee
+1 - great answer.
nickohrn
+1 - Though now you definitely have two problems.
Wickethewok
Expresso really barfs on both of those, though I respect BME's wisdom
Brent.Longborough
My example uses a recursive pattern and a possessive quantifier. I don't think these are supported by the .Net framework, so no wonder Expresso is confused. (Though could replace the possessive '+' with a .Net nonbacktracking group.)
Bennett McElwee
.Net has something called a balancing group, which apparently can be used to solve the balancing parentheses problem. It might even be more efficient that PCRE recursive groups.
Bennett McElwee