views:

100

answers:

3

To preface this, my knowledge of this kind of stuff is puny.

Anyways, I've been developing a context-free grammar to describe the structure of alegbraic expressions so I can teach myself how the CYK parsing algorithm works. I understand how such a structure can work with only infix algebraic expressions, but I cannot understand how to develop a grammar that can handle both the unary and binary definitions of the "-" operator.

For reference, here's the grammar I've written (where S is the start symbol) in CNF:

S -> x
A -> O S
S -> L B
B -> S R
S -> K S
O -> +
O -> -
O -> *
O -> /
O -> ^
K -> -
L -> (
R -> )

The problem is that how can the CYK parsing algorithm know ahead of time whether to decide between S -> K S and A -> O S when it encounters the "-" operator? Is such a grammar context-free anymore? And most importantly, since programming languages can handle languages with both the binary and unary minus sign, how should I reasonably parse this?

+2  A: 

The grammar is ambiguous, and the parser cannot decide which case to take.

You should probably use a grammar like the following:

S -> EXPR
EXPR -> (EXPR)
EXPR -> - EXPR
EXPR -> EXPR + EXPR
EXPR -> EXPR - EXPR
// etc...
kiamlaluno
What are you studying? It appears interesting.
daedalus0x1a4
The problem with such a grammar is that it isn't in Chomsky normal form, and (correct me if I'm wrong), that makes it a lot harder to get it to work with a CYK parser. In addition, I'm not entirely sure how to convert any CFG into a CNF grammar.
Tom O
+4  A: 

This seems like a problem related to finite state automata and I don't remember everything from my coursework, but I wrote a CYK parser in OCaml, so I'll go ahead and take a shot :)

If you're trying to parse an expression like 3- -4 for example, you would have your S -> K S rule consume the -4 and then your A -> O S rule would absorb the - -4. This would eventually work up to the top-most S production rule. You should be careful with the grammar you're using though, since the A production rule you listed cannot be reached from S and you should probably have a S -> S O S rule of some sort.

The way that CYK parsing algorithms work is through backtracking, not through the "knowing ahead of time" that you mentioned in your question. What your CYK algorithm should do is to parse the -4 as a S -> K S rule and then it would try to absorb the second - with the S -> K S rule again because this production rule allows for an arbitrarily long chain of unary -. But once your algorithm realizes that it's stuck with the intermediate parse 3 S, it realizes that it has no production symbols that it can use to parse this. Once it realizes that this is no longer parseable, it will go back and instead try to parse the - as an S -> O S rule instead and continue on its merry way.

This means that your grammar remains context-free since a context-sensitive grammar means that you have terminals on the left side of the production rules, so you're good in that respect. HTH!

SHC
Thanks, this greatly help with solving the primary problem of how to parse both the unary and binary definitions of the minus operator. :)
Tom O
A: 

Grammars based on algebraic expressions are rather difficult to disambiguate. Here are some examples of problems which need to be addressed:

a+b+c naturally creates two parse trees. To resolve this, you need to resolve the ambiguity of the associativity of +. You may wish to let a left-to-right parsing strategy take care of this for you, but careful: exponentiation should probably associate right-to-left.

a+b*c naturally creates two parse trees. To fix this problem, you need to deal with operator precedence.

implicit multiplication (a+bc), if it is allowed, creates all sorts of nightmares, mostly at tokenization.

unary subtraction is problematic, as you mention.

If we want to solve these problems, but still have a fast-parsing grammar specialized for algebra, one approach is to have various "levels" of EXPR, one for each level of binding required by precedence levels. For example,

TERM -> (S)
EXPO -> TERM ^ EXPO
PROD -> PROD * EXPO
PROD -> PROD / EXPO
PROD -> -PROD
SUM -> SUM + PROD
SUM -> SUM - PROD
S -> SUM

This requires that you also allow "promotion" of types: SUM -> PROD, PROD -> EXP, EXP -> TERM, etc, so that things can terminate.

Hope this helps!

Josephine