tags:

views:

576

answers:

9

For example, these are valid math expressions:

a * b + c
-a * (b / 1.50)
(apple + (-0.5)) * (boy - 1)

And these are invalid math expressions:

--a *+ b @ 1.5.0  // two consecutive signs, two consecutive operators, invalid operator, invalid number
-a * b + 1)  // unmatched parentheses
a) * (b + c) / (d  // unmatched parentheses

I have no problem with matching float numbers, but have difficulty with parentheses matching. Any idea? If there is better solution than regular expression, I'll accept as well. But regex is preferred.

========

Edit:

I want to make some comments on my choice of the “accepted answer”, hoping that people who have the same question and find this thread will not be misled.

There are several answers I consider “accepted”, but I have no idea which one is the best. So I chose the accepted answer (almost) randomly. I recommend reading stereofrog’s answer and Guillaume Malartre’s answer as well besides the accepted answer. All of them give practical solutions to my question. For a somewhat rigorous/theoretical answer, please read David Thornley’s comments under the accepted answer. As he mentioned, Perl’s extension to regular expression (originated from regular language) make it “irregular”. (I mentioned no language in my question, so most answerers assumed the Perl implementation of regular expression – probably the most popular implementation. So did I when I posted my question.)

Please correct me if I said something wrong above.

+2  A: 

You can't use regex to do things like balance parenthesis.

Eric H.
That is only correct if they can be nested arbitrarily deep.
klausbyskov
Yes you can. Please see the post below...
drewk
No, you can't. If there are expressions with which you can ensure the balance of parentheses, they're not *regular* expressions.
jemfinch
@drewk: "the post below" has no meaning here, since the order of answers is variable. Please identify it as "X's answer" or something like that.
David Thornley
I think jemfinch was mentioning Victor Hurdugaci's post. I am reading the link given by Victor. I got confused getting two opposite kinds of answers.
Ethan
@Ethan: Victor's post is not about using regular expressions, but rather a pushdown automaton, which can handle balanced expressions. You can't check balanced parentheses with regular expressions, but you can with some enhancements. Victor gave one sort of enhancement (which is theoretically simpler), and drewk gave another (using some of the Perl extensions to regular expressions).
David Thornley
There are regular expression implementations that support recursive patterns (like PCRE or .NET’s). But those are not regular expressions in terms of formal language theory.
Gumbo
Who cares about "theoretical" regexps? Functions in programming are quite different from mathematical "functions", is this a reason to avoid them? This answer is wrong, -1.
stereofrog
+1  A: 

For parenthesis matching, and implementing other expression validation rules, it is probably easiest to write your own little parser. Regular expressions are no good in this kind of situation.

Eric Eijkelenboom
I would actually try an eval in Perl or Python - seems like the fastest way to test. The expression is meant to be evaluated anyway, so why not:`try { worked = true; evaluate() } catch (...) { worked = false; }`
Hamish Grubijan
+5  A: 

I believe you will be better off implementing a real parser to accomplish what you're after.

A parser for simple mathematical expressions is "Parsing 101", and there are several examples to be found online.

Some examples include:

Note that the grammar you will need for validating expressions is simpler than the examples above, since the examples also implement evaluation of the expression.

codeape
A LALR(1) grammar is way overkill for this. The question describes a context-free language, and one that can be recognized by a relatively small addition to regular experssions.
David Thornley
That depends, IMO. A parser grammar might be easier to maintain and extend than a complex regex.
codeape
+4  A: 

Regular expressions can only used to recognize regular languages. The language of mathematical expressions is not regular; you'll need to implement an actual parser (e.g. LR) in order to do this.

Rob Lachlan
"regular expressions are for regular languages" is a tautology and makes no sense.
stereofrog
@stereofrog: It isn't quite a tautology. You may feel it's too simple, but all kinds of programmers aren't even aware of formal languages, don't know what a context-free language is, what a regular language is, and why regular expressions might not be the right choice for a particular problem.
Rob Lachlan
+6  A: 

Use a pushdown automaton for matching paranthesis http://en.wikipedia.org/wiki/Pushdown_automaton (or just a stack ;-) )

Details for the stack solution:

while (chr available)
    if chr == '(' then
      push '('
    else
      if chr == ')' then
        if stack.elements == 0 then
          print('too many or misplaced )')
          exit
        else
          pop //from stack
end while
if (stack.elements != 0)
  print('too many or misplaced(')

Even simple: just keep a counter instead of stack.

Victor Hurdugaci
+2  A: 

Matching parens with a regex is quite possible.

Here is a Perl script that will parse arbitrary deep matching parens. While it will throw out the non-matching parens outside, I did not design it specifically to validate parens. It will parse arbitrarily deep parens so long as they are balanced. This will get you started however.

The key is recursion both in the regex and the use of it. Play with it, and I am sure that you can get this to also flag non matching prens. I think if you capture what this regex throws away and count parens (ie test for odd parens in the non-match text), you have invalid, unbalanced parens.

#!/usr/bin/perl
$re = qr  /
     (                      # start capture buffer 1
        \(                  #   match an opening paren
        (                   # capture buffer 2
        (?:                 #   match one of:
            (?>             #     don't backtrack over the inside of this group
                [^()]+    #       one or more 
            )               #     end non backtracking group
        |                   #     ... or ...
            (?1)            #     recurse to opening 1 and try it again
        )*                  #   0 or more times.
        )                   # end of buffer 2
        \)                  #   match a closing paren
     )                      # end capture buffer one
    /x;


sub strip {
    my ($str) = @_;
    while ($str=~/$re/g) {
        $match=$1; $striped=$2;
        print "$match\n";
        strip($striped) if $striped=~/\(/;
        return $striped;
    }
}

while(<DATA>) {
    print "start pattern: $_";
    while (/$re/g) { 
        strip($1) ;
    }
}   

__DATA__
"(apple + (-0.5)) * (boy - 1)"
"((((one)two)three)four)x(one(two(three(four))))"
"a) * (b + c) / (d"
"-a * (b / 1.50)"

Output:

start pattern: "(apple + (-0.5)) * (boy - 1)"
(apple + (-0.5))
(-0.5)
(boy - 1)
start pattern: "((((one)two)three)four)x(one(two(three(four))))"
((((one)two)three)four)
(((one)two)three)
((one)two)
(one)
(one(two(three(four))))
(two(three(four)))
(three(four))
(four)
start pattern: "a) * (b + c) / (d"
(b + c)
start pattern: "-a * (b / 1.50)"
(b / 1.50)
drewk
While this works, I won't +1 it because anyone using extended regex hacks to match nested parenthesis is asking for a headache. It's much better to do it another way.
Chris Lutz
Perl's extensions to the regular expression language make them no longer *regular* expressions.
Greg Hewgill
@chris: This is not a "hack" it is directly from the Perl perlre man page. The only mod I did is is change angle bracket to prens and added the outside capturing group. I wrote the recursive subroutine. Since when is recursion and using the documented features of Perl a "hack"?
drewk
Why not use Perl's `eval`?
Hamish Grubijan
@drewk: The question mentioned no language, so most answers assumed the poster meant, um, regular expressions. What you have written is, at first glance, a pushdown automaton, using a recursive function to use the call stack as the pushdown stack. The recursion, as you say, is the key, and regular expressions have no recursion. Moreover, you didn't write a pattern in Perl's extended regular expressions, but provided a recursive function to use that pattern.
David Thornley
@drewk - The extensions to regex syntax are a hack. They are hacking a tool (regular expressions) to do things it wasn't designed for, and isn't particularly good for. IMHO, if you have to comment a regex, you shouldn't be using a single regex to accomplish the task.
Chris Lutz
@Greg: Granted, this is not POSIX form. Most modern regex flavors support most of the Perl 5.10 extensions, what is the issue? What about the Perl 5.10 regex extensions make them not regular? This regex uses a named group and an atomic group. I think most modern regex flavors have exactly this or a close variant. Look at the wikipedia comparison of regex flavors. Only Java, Ruby and Python currently lack the atomic grouping used, but I am sure it will soon be added. http://www.regular-expressions.info/atomic.html http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines
drewk
@drewk - "What about the Perl 5.10 regex extensions make them not regular?" The ability to do this. "Regular" expressions only parse "regular" languages, and the language that describes balanced parenthesis is more complicated than "regular" (it is context-free). Therefore, anything that can parse them is more complicated than a mere "regular" expression, which is why the Perl 6 documentation is using the term "regex" exclusively.
Chris Lutz
@chris: what extensions to regex are hacks? I like comments in regexs; you don't? Every modern regex supports comments except for Java. While I agree that a parser is a better tool, YAML would work, there are many ways to solve the poster's question -- he asked for a regex and this one works with very standard syntax. Maybe his problem was a two minute problem and he didn't want to lear to use a parser for goodness sakes... Read about atomic grouping at www.regular-expressions.info. It is very standard stuff that I used.
drewk
@chris: Well college was a few year ago for me :-) but I believe that a series of >>matching<< parenthesis actually are consider "regular" in the sense of that any string containing balanced parenthesis can be reduced in matching order and both the reduction and remainder still qualify as balanced and regular, no? You could pigeonhole each parenthesis in a string of any length in any order and you would have 1 paren per pigeon hole. Non matching parenthesis are definitely not "regular" and parenthesis with contents may or may not be regular. Do you proof your strings as formally "regular"?
drewk
This is a pretty cool example of recursion and backreferences. For others that would like to see a more detailed explanation, see http://rick.measham.id.au/paste/explain.pl?regex=%28%5C%28%28%28%253F%253A%28%253F%3E%5B%5E%28%29%5D%252B%29%7C%28%5C1%29%29%2A%29%5C%29%29
macek
@drewk: You do not seem to realize what a "regular expression" really is. A regular expression is one that can be specified by matching sequences, matching alternatives, matching repetitions, or any combination of them. Formally, you can match a regular language with a finite automaton, and the word "finite" should show you can't match unlimited numbers of parentheses. You're adding recursion, which solves the problem. Formally, it turns the finite automaton to a pushdown automaton. However, you're now matching context-free expressions and not regular expressions.
David Thornley
@smotchkkiss: That is a really cool thing. Did you write it?
drewk
@David: I disagree I do not realize what a regular expression really is, but my teenagers do tell me daily how little I do know. :-) Is any non-infinite set of matching parentheses regular? Yes. If you remove parenthesis in matching order, each part is also regular. The operation is idempotent. I grant that non-matching parenthesis are by definition not 'regular' As a practical matter, do you only apply regular expression to regular language? My regex may not match an unlimited number of parens, but no regex implementation can deal with infinite matches. Engineering vs theory...
drewk
@drewk, I did not. Rick Measham (http://rick.measham.id.au/) put this wrapper together based off of http://search.cpan.org/~pinyan/YAPE-Regex-Explain-3.011/Explain.pm
macek
A: 

Well, people telling its impossible are wrong. I dit it for calculation validation (present in the calculator) and it sure is a lot of work. If you have more interest in the solution presented in the video I could probably take more time to share a simple example of parenthesis group analysis with regexp.

Guillaume Malartre
I'd be very interested in knowing how you did that, considering that it's impossible.
David Thornley
@Guillaume Malartre: why not put your simple example here to make the discussion more meaningful and informative, if it won't take you a lot of time?
Ethan
"Well, people telling its impossible are wrong."Well, no they're not, if you understand the terms being discussed.
Eric H.
@Eric H. did you change the terms being discussed? Did Ethan asked how he could analyse a math expression? A parser is good for trying to evaluate it reading it from left to right but if you want to group math expression into object you need a way to split those groups. RegExp are good to find group. They can most definitely answer to the question "Matching math expression with regular expression?" with "Yes you can, and you can even go further.".
Guillaume Malartre
You cannot balance parens with a regular expression. It just cannot be done. You can do it with extenesions to regular expressions, but then they're not regular any more.
Eric H.
+1  A: 

This is tricky with one single regular expression, but quite easy using mixed regexp/procedural approach. The idea is to construct a regexp for the simple expression (without parenthesis) and then repeatedly replace ( simple-expression ) with some atomic string (e.g. identifier). If the final reduced expression matches the same `simple' pattern, the original expression is considered valid.

Illustration (in php).

function check_syntax($str) {

    // define the grammar
    $number = "\d+(\.\d+)?";
    $ident  = "[a-z]\w*";
    $atom   = "[+-]?($number|$ident)";
    $op     = "[+*/-]";
    $sexpr  = "$atom($op$atom)*"; // simple expression

    // step1. remove whitespace
    $str = preg_replace('~\s+~', '', $str);

    // step2. repeatedly replace parenthetic expressions with 'x'
    $par = "~\($sexpr\)~";
    while(preg_match($par, $str))
        $str = preg_replace($par, 'x', $str);

    // step3. no more parens, the string must be simple expression
    return preg_match("~^$sexpr$~", $str);
}


$tests = array(
    "a * b + c",
    "-a * (b / 1.50)",
    "(apple + (-0.5)) * (boy - 1)",
    "--a *+ b @ 1.5.0",
    "-a * b + 1)",
    "a) * (b + c) / (d",
);

foreach($tests as $t)
    echo $t, "=", check_syntax($t) ? "ok" : "nope", "\n";

The above only validates the syntax, but the same technique can be also used to construct a real parser.

stereofrog
+1 Good illustration. Minor quibble: I don't think you can remove whitespace so early, because it can turn an invalid expression into a valid one. For example: `f o o + 1` becomes `foo+1`.
FM
@FM: good point. This was of course, a bit lazy, in the real world, whitespace should be added to the grammar, like `expr = atom (ws* op ws* atom)*`
stereofrog
+1  A: 

Ok here's my version of parenthesis finding in ActionScript3, using this approach give a lot of traction to analyse the part before the parenthesis, inside the parenthesis and after the parenthis, if some parenthesis remains at the end you can raise a warning or refuse to send to a final eval function.

package {
import flash.display.Sprite;
import mx.utils.StringUtil;
public class Stackoverflow_As3RegexpExample extends Sprite
{
    private var tokenChain:String = "2+(3-4*(4/6))-9(82+-21)"
    //Constructor
    public function Stackoverflow_As3RegexpExample() {
        // remove the "\" that just escape the following "\" if you want to test outside of flash compiler.
        var getGroup:RegExp = new RegExp("((?:[^\\(\\)]+)?)   (?:\\()       (  (?:[^\\(\\)]+)? )    (?:\\))        ((?:[^\\(\\)]+)?)", "ix")   //removed g flag
        while (true) {
            tokenChain = replace(tokenChain,getGroup)
            if (tokenChain.search(getGroup) == -1) break; 
        }
        trace("cummulativeEvaluable="+cummulativeEvaluable)
    }
    private var cummulativeEvaluable:Array = new Array()
    protected function analyseGrammar(matchedSubstring:String, capturedMatch1:String, capturedMatch2:String,  capturedMatch3:String, index:int, str:String):String {
        trace("\nanalyseGrammar str:\t\t\t\t'"+str+"'")
        trace("analyseGrammar matchedSubstring:'"+matchedSubstring+"'")
        trace("analyseGrammar capturedMatchs:\t'"+capturedMatch1+"'  '("+capturedMatch2+")'   '"+capturedMatch3+"'")
        trace("analyseGrammar index:\t\t\t'"+index+"'") 
        var blank:String = buildBlank(matchedSubstring.length)
        cummulativeEvaluable.push(StringUtil.trim(matchedSubstring))
        // I could do soo much rigth here!
        return str.substr(0,index)+blank+str.substr(index+matchedSubstring.length,str.length-1)
    }
    private function replace(str:String,regExp:RegExp):String {
        var result:Object = regExp.exec(str)
        if (result)
            return analyseGrammar.apply(null,objectToArray(result)) 
        return str
    }
    private function objectToArray(value:Object):Array {
        var array:Array = new Array()
        var i:int = 0
        while (true) {
            if (value.hasOwnProperty(i.toString())) {
                array.push(value[i])
            } else {
                break;
            }
            i++
        }
        array.push(value.index)
        array.push(value.input)
        return array
    }
    protected function buildBlank(length:uint):String {
        var blank:String = ""
        while (blank.length != length)
            blank = blank+" "
        return blank
    }
}

}

It should trace this:

analyseGrammar str:             '2+(3-4*(4/6))-9(82+-21)'
analyseGrammar matchedSubstring:'3-4*(4/6)'
analyseGrammar capturedMatchs:  '3-4*'  '(4/6)'   ''
analyseGrammar index:           '3'

analyseGrammar str:             '2+(         )-9(82+-21)'
analyseGrammar matchedSubstring:'2+(         )-9'
analyseGrammar capturedMatchs:  '2+'  '(         )'   '-9'
analyseGrammar index:           '0'

analyseGrammar str:             '               (82+-21)'
analyseGrammar matchedSubstring:'               (82+-21)'
analyseGrammar capturedMatchs:  '               '  '(82+-21)'   ''
analyseGrammar index:           '0'
cummulativeEvaluable=3-4*(4/6),2+(         )-9,(82+-21)
Guillaume Malartre