tags:

views:

480

answers:

7

I'm trying to write a regular expression to identify an if statement. The only problem I'm having is getting it capture if statements that have parentheses in their parenthesis. For example:

if (condition_function(params)) {
     statements;
}

My expression to capture all if statements except these is:

 if\s*\(([^\(\)]|\s)*\)\s*{(.|\s)*?}

Does anyone know how to write that?

+10  A: 

That is not possible with regular expressions since regular expressions can only match regular languages and the one you are trying to parse is context-free and not regular (thanks to dirkgently and dmckee).

Have a look at WP: Formal language theory is you are interested...

Btw. You can't even check an expression only made of parentheses if it's correct ( [[][]] is correct but []][ is not) which is a "subproblem" of the one you gave above.

Johannes Weiß
Every regular language is context-free.
dirkgently
+1, but fix up the nit that dirkgently picked...
dmckee
dirkgently, true but not vice-versa.
Johannes Weiß
But you haven't told the OP that the language he's trying to match is irregular...saying that it is context free doesn't help on that account.
dmckee
I don't really get it. What do you mean by irregual. Because his regex allows strange things like if(}{){}
Johannes Weiß
I mean the language he want to match language is context free *and* not regular. Saying "the one you are trying to parse is context-free" does not preclude it from being regular. That is dirkgently's point. Regular is a proper subset of context free.
dmckee
Ahh, thanks, now I got it. Sorry!
Johannes Weiß
A: 

If you have to use a regexp even though it will never catch all cases, this one is better:

if\s*\(((?!\s*\{).+)\)\s*\{(.|\s)*?\}

It uses a positive lookahead ((?!\s*\{).) which ensure to capture all until the closing ) (except if your condition statement has a " {" in it! That is where the regexp can not help you)

VonC
+3  A: 

You're trying to write a regular expression to parse a non-regular language? That'll never fly.

Ken
If not regular expressions, then what would you suggest to parse this non-regular language?
Koukaakiva
Your compiler uses a lexer followed by syntactic analysis. If you only need to identify if-statements, and can't use an existing parser, you can probably write a recursive descent parser without too much trouble.
Ken
A: 

a quick shot at it...

if\s*?(.?)\s?(({?\s*?(.?;)+\s?})|(.*?;))

A: 
r = /\bif\s*\(/

txt = <<TXT
if(test)
if (test)
if  (xyz)
; if
print x if(true)
TXT

p txt.scan(r)

if(something).. something can be anything.. if there is a string with a parenthesis end inside it and you want to deal correctly with matching parenthesis pairs then you will quickly end up with a big regex.

Also what language are you trying to match against?

neoneye
I am trying to build a parser for the mediaWiki engine that is better suited to the project I'm working on. One of the thing we are continually encountering is the need for if statements. The mediaWiki extension for if statements does not match what we need them for either.
Koukaakiva
so if I understand you correct (I don't know mediawiki).. you needs optional-parts in the mark up, that can be enabled/disabled ?
neoneye
if you just needed to extract if statements then you could use regex. However for the task I think you need to write your own parser. Regex can't solve it alone. It's not an easy task, good luck.
neoneye
+1  A: 

I think this may work. If anyone sees something I don't, like a reason it won't work, please respond.

if\s*\(((?:[^\(\)]|\((?1)\))*+)\)\s*{((?:[^{}]|{(?2)})*+)}

The only problem this should encounter now is if there is an if statement in an if statement.

I've tested this on every valid if statement that I can think of that might break it and the only thing that it does not work on is one that contains a string with an unmatched parenthesis.

Update: I found an error with the above regular expression. It does not catch if statements that contains strings with unmatched parenthesis in their condition or statement sections. Like the following example:

if (var1 == "("){
    echo "{";
}

However this is a valid if statement. The solution:

if\s*\(((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^\(\)]|\((?1)\))*+)\)\s*{((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^{}]|{(?2)})*+)}\s*

This regular expression captures all if statements even ones that contain strings with unmatched parenthesis.

UPDATE: I now have it so that is captures the else and else if statements that are attached to if statements. The only problem is that the capture groups it returns are the last else and the last else if in the if statement. Hopefully I'll figure out how to get around that as well.

if\s*\(((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^\(\)]|\((?1)\))*+)\)\s*{((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^{}]|{(?2)})*+)}\s*(?:(?:else\s*{((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^{}]|{(?3)})*+)}\s*)|(?:else\s*if\s*\(((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^\(\)]|\((?4)\))*+)\)\s*{((?:(?:(?:"(?:(?:\\")|[^"])*")|(?:'(?:(?:\\')|[^'])*'))|[^{}]|{(?5)})*+)}\s*))*;

If you want to test it out, here's a great website for it: http://gskinner.com/RegExr/

Koukaakiva
You still won't catch everything, because you're not using a real parser. As but one trivial example, this will match a string literal like x="if (x) { y; }";
Ken
That's exactly what I'm trying to do. This is being used for an alternative parse engine for mediaWiki. Since this finds if statements, I'll be able to take them out and run them as if statements in my code.
Koukaakiva
Incredible. I'll give you +1 but I hope I never see this in any code that I have to touch. For extra points, translate it to brainf**k
erikkallen
Thank you very much. Now I'm working to get it to also capture if statements with else and else if statements. Wish me luck.
Koukaakiva
+1  A: 

You need to write code in a Turing complete language. There are tools that can automatically construct the code for you, such as Flex. However, if you just have a simple problem, it is probably easiest to just write some simple parsing code yourself. Here is some example C# code that might help you get started.

public void TestIf()
    {
      var s = @"if (condition_function(params)) {
     statements;
       }";
      var ifRegex = @"if *\(.*\) *{.*}";
      if (Regex.IsMatch(s, ifRegex, RegexOptions.Singleline))
      {
        var firstParens = s.IndexOf('(');
        if (firstParens != -1)
        {
          var conditionPart = s.Skip(firstParens + 1);
          int stack = 1;
          int lastParens = -1; 
          while(stack > 0)
          {
            for (int i = 0; i < conditionPart.Count(); i++)
            {
              var c = conditionPart.ElementAt(i);
              if (c == '(')
              {
                stack++;
              }
              if (c == ')')
              {
                stack--;
              }
              if (stack == 0)
              {
                lastParens = i;
                break; 
              }
            }
          }
          if (lastParens != -1)
          {
            var condition = conditionPart.Take(lastParens);
            Console.WriteLine(new string(condition.ToArray()));
          }
        }
      }
    }
RossFabricant
This is very helpful. Looking at other portions of my code, I realize I may have to switch to something like this. And this code will probably save me a lot of time. Thanks.
Koukaakiva
And you may also have to account for strings and comments which may contain brackets.
pro3carp3
You probably want to strip out comments in an earlier step. Don't forget to deal with nested comments!
RossFabricant