views:

548

answers:

7

I would like to parse boolean expressions in PHP. As in:

A and B or C and (D or F or not G)

The terms can be considered simple identifiers. They will have a little structure, but the parser doesn't need to worry about that. It should just recognize the keywords and or not ( ). Everything else is a term.

I remember we wrote simple arithmetic expression evaluators at school, but I don't remember how it was done anymore. Nor do I know what keywords to look for in Google/SO.

A ready made library would be nice, but as I remember the algorithm was pretty simple so it might be fun and educational to re-implement it myself.

+1  A: 

I'd go with Pratt parser. It's almost like recursive descent but smarter :) A decent explanation by Douglas Crockford (of JSLint fame) here.

Rafał Dowgird
Wouldn't it be a bit of overkill for something as simple as an expression?
Vilx-
+2  A: 

Why not jsut use the PHP parser?

 $terms=array('and','or','not','A','B','C','D'...);
 $values=array('*','+','!',1,1,0,0,1....);

 $expression="A and B or C and (D or F or not G)";
 $expression=preg_replace($terms, $values,$expression);
 $expression=preg_replace('^(+|-|!|1|0)','',$expression);
 $result=eval($expression);

Actually, that 2nd regex is wrong (and only required if you need to prevent any code injection) - but you get the idea.

C.

symcbean
Well... it's a possibility I guess... Although it will be a bit trickier because the terms will have special format too - like #23@45
Vilx-
A: 

You could use an LR parser to build a parse tree and then evaluate the tree to obtain the result. A detailed description including examples can be found in Wikipedia. If you haven't coded it yourself already I will write a small example tonight.

ahans
Talk about overkill....
Vilx-
Doing this just for logical expressions like that wouldn't be overkill. You will have to use some sort of parsing. But I have to correct the parse tree thing though: you won't have to build that up explicitly but can evaluate the result on the fly. Wait for my example and you'll see that it's really easy. ;-) However, maybe just using PHP's parser via eval() is the most practical solution for you.
ahans
Implementing the LR parser won't happen today, but now I'm interested how this solution would look like compared to the shunting yard ...
ahans
+2  A: 

Dijkstra's shunting yard algorithm is the traditional one for going from infix to postfix/graph.

plinth
Ahh, this was the one we used in school! Though I just realized - how does it work with the unary NOT operator?
Vilx-
Not much different than binary. You're converting to RPN, so A OR NOT B should become A B NOT OR in rpn.
plinth
A: 

The simplest way is to use regexes that converts your expression into an expression in php syntax and then use eval, as suggested by symcbean. But I'm not sure if you would want to use it in production code.

The other way is to code your own simple recursive descent parser. It isn't as hard as it might sound. For a simple grammar such yours (boolean expressions), you can easily code one from scratch. You can also use a parser generator similar to ANTLR for php, probably searching for a php parser generator would turn up something.

MAK
A: 

I've implemented the shunting yard algorithm as suggested by plinth. However, this algorithm just gives you the postfix notation, aka reverse Polish notation (RNP). You still have to evaluate it, but that's quite easy once you have the expression in RNP (described for instance here).

The code below might not be good PHP style, my PHP knowledge is somewhat limited. It should be enough to get the idea though.

$operators = array("and", "or", "not");
$num_operands = array("and" => 2, "or" => 2, "not" => 1);
$parenthesis  = array("(", ")");

function is_operator($token) {
    global $operators;
    return in_array($token, $operators);
}

function is_right_parenthesis($token) {
    global $parenthesis;
    return $token == $parenthesis[1];
}

function is_left_parenthesis($token) {
    global $parenthesis;
    return $token == $parenthesis[0];
}

function is_parenthesis($token) {
    return is_right_parenthesis($token) || is_left_parenthesis($token);
}

// check whether the precedence if $a is less than or equal to that of $b
function is_precedence_less_or_equal($a, $b) {
    // "not" always comes first
    if ($b == "not")
        return true;

    if ($a == "not")
        return false;

    if ($a == "or" and $b == "and")
        return true;

    if ($a == "and" and $b == "or")
        return false;

    // otherwise they're equal
    return true;
}


function shunting_yard($input_tokens) {
    $stack = array();
    $output_queue = array();

    foreach ($input_tokens as $token) {
        if (is_operator($token)) {
            while (is_operator($stack[count($stack)-1]) && is_precedence_less_or_equal($token, $stack[count($stack)-1])) {
                    $o2 = array_pop($stack);
                    array_push($output_queue, $o2);
            }
            array_push($stack, $token);

        } else if (is_parenthesis($token)) {
            if (is_left_parenthesis($token)) {
                array_push($stack, $token);
            } else {
                while (!is_left_parenthesis($stack[count($stack)-1]) && count($stack) > 0) {
                    array_push($output_queue, array_pop($stack));
                }
                if (count($stack) == 0) {
                    echo ("parse error");
                    die();
                }
                $lp = array_pop($stack);
            }
        } else {
            array_push($output_queue, $token);  
        }
    }

    while (count($stack) > 0) {
        $op = array_pop($stack);
        if (is_parenthesis($op))
            die("mismatched parenthesis");
        array_push($output_queue, $op);
    }

    return $output_queue;
}

function str2bool($s) {
    if ($s == "true")
        return true;
    if ($s == "false")
        return false;
    die('$s doesn\'t contain valid boolean string: '.$s.'\n');
}

function apply_operator($operator, $a, $b) {
    if (is_string($a))
        $a = str2bool($a);
    if (!is_null($b) and is_string($b))
        $b = str2bool($b);

    if ($operator == "and")
        return $a and $b;
    else if ($operator == "or")
        return $a or $b;
    else if ($operator == "not")
        return ! $a;
    else die("unknown operator `$function'");
}

function get_num_operands($operator) {
    global $num_operands;
    return $num_operands[$operator];
}

function is_unary($operator) {
    return get_num_operands($operator) == 1;
}

function is_binary($operator) {
    return get_num_operands($operator) == 2;
}

function eval_rpn($tokens) {
    $stack = array();
    foreach ($tokens as $t) {
        if (is_operator($t)) {
            if (is_unary($t)) {
                $o1 = array_pop($stack);
                $r = apply_operator($t, $o1, null);
                array_push($stack, $r);
            } else { // binary
                $o1 = array_pop($stack);
                $o2 = array_pop($stack);
                $r = apply_operator($t, $o1, $o2);
                array_push($stack, $r);
            }
        } else { // operand
            array_push($stack, $t);
        }
    }

    if (count($stack) != 1)
        die("invalid token array");

    return $stack[0];
}

// $input = array("A", "and", "B", "or", "C", "and", "(", "D", "or", "F", "or", "not", "G", ")");
$input = array("false", "and", "true", "or", "true", "and", "(", "false", "or", "false", "or", "not", "true", ")");
$tokens = shunting_yard($input);
$result = eval_rpn($tokens);
foreach($input as $t)
    echo $t." ";
echo "==> ".($result ? "true" : "false")."\n";
ahans
+5  A: 

Recursive descent parsers are fun to write and easy to read. The first step is to write your grammar out.

Maybe this is the grammar you want.

expr        = and_expr ('or' and_expr)*
and_expr    = not_expr ('and' not_expr)*
not_expr    = simple_expr | 'not' not_expr
simple_expr = term | '(' expr ')'

Turning this into a recursive descent parser is super easy. Just write one function per nonterminal.

def expr():
    x = and_expr()
    while peek() == 'or':
        consume('or')
        y = and_expr()
        x = OR(x, y)
    return x

def and_expr():
    x = not_expr()
    while peek() == 'and':
        consume('and')
        y = not_expr()
        x = AND(x, y)
    return x

def not_expr():
    if peek() == 'not':
        consume('not')
        x = not_expr()
        return NOT(x)
    else:
        return simple_expr()

def simple_expr():
    t = peek()
    if t == '(':
        consume('(')
        result = expr()
        consume(')')
        return result
    elif is_term(t):
        consume(t)
        return TERM(t)
    else:
        raise SyntaxError("expected term or (")

This isn't complete. You have to provide a little more code:

  • Input functions. consume, peek, and is_term are functions you provide. They'll be easy to implement using regular expressions. consume(s) reads the next token of input and throws an error if it doesn't match s. peek() simply returns a peek at the next token without consuming it. is_term(s) returns true if s is a term.

  • Output functions. OR, AND, NOT, and TERM are called each time a piece of the expression is successfully parsed. They can do whatever you want.

  • Wrapper function. Instead of just calling expr directly, you'll want to write a little wrapper function that initializes the variables used by consume and peek, then calls expr, and finally checks to make sure there's no leftover input that didn't get consumed.

Even with all this, it's still a tiny amount of code. In Python, the complete program is 84 lines, and that includes a few tests.

Jason Orendorff
+1 for fun! I've always considered it the best heuristic towards good programming.
Edmund
Vilx-
This particular one only looks ahead one symbol, but if you need a little extra lookahead, you can certainly add a `peek_ahead(n)` function and go nuts. This is actually a strength of recursive descent: hacking it to do something ad hoc (because the language you're parsing is a little screwy) is usually quite straightforward.
Jason Orendorff