views:

277

answers:

4

Hi,

I am working on a math expression parser using regular expressions and I am trying to add support for parentheses.

My parser works like this:

function parse_expression(expression){
    Find parenthetical expressions
    Loop through parenthetical expressions, call parse_expression() on all of them
    Replace parenthetical expression with value of expression
    Find value of expression
    Return value
}

Because it it recursive, I need to find only the outmost parenthetical expressions. For example if I was parsing the string "(5 + (4 + (3 / 4) + (3 * 2) + 2)) + (1 + 2)", I want to find the expressions "5 + (4 + (3 / 4) + (3 * 2) + 2)" and "1 + 2". How do you do this with Regular Expressions?

The regular expression I have now ( "\(([^\)]+)\)" ) would return just "5 + ( 4 + ( 3 * 2", it doesn't get the full first expression and it gets none of the second.

Any ideas?

Thanks,

Kyle

+2  A: 

If I'm not mistaken, this language is not regular, so it is a theoretical impossibility to do this with regular expressions.

Thomas
+6  A: 

Unfortunately, the language of arbitrarily nesting parenthesis is not regular and can therefore not be matched using a regular expression.

Specifically, a regular language is one that can be parsed using a finite automata, which has a (set) finite number of states. To match an arbitrarily-nested set of parentheses requires an arbitrary number of states, to count the parentheses as they go past.

Most "regular expression" libraries (especially perl's) don't strictly match a regular language, but they still have this restriction.

The most straightforward way to solve your problem is a recursive descent parser. An inefficient method is to just look through the string, counting parentheses as you go, to find which sub-strings to descend into.

You will also find your parser to be simpler if you insist that operations are parenthesised, for example only allowing (1+2)+3 or 1+(2+3) rather than 1+2+3.

Andrew Aylett
+2  A: 

You should be using a parser. Have it parser traverse the string, and increment the parentheses count each time it encounters a (, and decrement the count each time it hits a ). when it next hits zero count, you have the range of your outermost parenthetical expression.

Kenny Winker
I was considering this, but I think David Hedlund's solution suits my problem better.
Kyle
Good call. I like his solution better myself :D
Kenny Winker
+4  A: 

Since you're iterating through it all, I'd say you should still do that, but go the other way around. Find the smallest subsets of paranthetical expressions, rather than the largest ones:

(\([^(]+\))

Evaluate them, and replace them with their values, i.e., first time round, the matches will be (3 / 4), (3 * 2) and (1 + 2). Replace these with 0,75, 6 and 3, respectively, giving a new string:

(5 + (4 + 0,75 + 6 + 2)) + 3

And then you iterate that, until there are no more parenthetical expressions, working bottom-up rather than top-down (just like you would manually solve a task like this!)

Other than that, I agree with all others that exactly what you were asking for should not (indeed could not) be done with regular expressions. But your problem could be solved with this solution that involves regular expressions.

David Hedlund
This is a clever solution. It would require that you do a lot of string manipulation, and it will require that you re-parse numbers a lot, but it should work otherwise.
Daniel Yankowsky
I Like it. That should work great. Thanks.
Kyle
I tried this and there should be a slight correction to the regex you provided. It should be '(\([^\(]+?\))' (added a backslash to escape the parenthesis and a question mark to make it match as few as possible).
Kyle
The backslash is not needed within the group-brackets,`[`, `]`. the left parenthesis is not a special character within that group. Also, you do not want as few characters as possible, you really want as many as possible, as long as we've specified it must not be a left parenthesis. the non-greedy operator `?` is just a performance thief in that scenario. i've tested my original expression and it works =)
David Hedlund
If you match as many as possible you get all of the ending parentheses, not just the one that closes the innermost parenthetical expression. For example, the string "(5+(4+(3*2)+2))" has the match "(3*2)+2))" (or at least it does for me), which is not what I want.
Kyle
hey, it does in my test before, i didn't notice. for performance, i'd consider using this `(\([^()]+\))` which would also solve that problem. but hey, whichever works for you is fine :) good luck with the rest of the problem, i take it a working regex won't be enough to bring this one home
David Hedlund
Thanks, that regex works perfectly. This exemplifies exactly what makes asking for programming help so difficult: it never seems to be the same for the asker and the answer.
Kyle
hey, this kept me awake. if you think i'm spoiling the fun of the task here, don't read the code, but i got this working: http://pastebay.com/78093 a javascript that'll solve a lot of these issues. it handles multiple operations in one parenthetical expression (i.e. `(1+5/2)`). pretty much the only thing it doesn't handle (well, within the realm of +-*/ operations), is that if you'd write `2(1+2)` it wouldn't work (proper way would be to *infer* `2*(1+2)`)
David Hedlund
Ok, I actually finished my algorithm shortly after reading this post. It is written in Objective-C, but I decided to port it over to JavaScript for comparison (http://pastebay.com/78117). Mine had to handle fractions, decimals and integers, so it is a bit more involved. Also, mine was not originally written in JavaScript, and I didn't take the time to clean it up much, so it looks a bit odd. I take more of a parser approach, using a value and operator stack to hold my operations. It handles `2(1+2)` through a string replace before parsing.Probably could be optimized a bit but it seems to work.
Kyle
that's awesome. always fun to compare with how somebody else would approach the same problem =) good job
David Hedlund