tags:

views:

283

answers:

4

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.

What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?

Two questions:

  1. What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
  2. In what ways, if any, do those constructions manifest as desirable language constructs?

There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.


edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:

When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.

Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?

A: 

The capabilities of a language are not limited by its syntax and grammar.

It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Ben S
Sure, LL(1) is sufficient for Turing completeness. But I'm specifically interested in syntactic constructions that are made possible by moving to a more powerful formal language class.
Ben Karel
A: 

well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):

lexer rule expression = other rules
                        | expression
                        | '(' expression ')';

As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?

RCIX
I agree that LL grammars do not express the structure of a language directly as most people probably think about them. (I'd say an even bigger impedance mismatch than left recursion arises from lacking declarative operator precedence, and having to encode operator precedence in the grammar productions.)But that small awkwardness is a property of the grammar, not the language defined by the grammar. Given a language, the "user" of the language doesn't care if the grammar has warts. And in practice, left factoring's not that painful, especially because ANTLRWorks has the refactoring built in.
Ben Karel
I tried that once, actually, and it just screwed up my grammar (didn't even get rid of the error)
RCIX
+5  A: 

Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.

My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.

To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.

But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.

Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)

In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.

John Clements
Do you have any other specific examples of the shortcomings of LL(k)? An answer of "all kinds of things" merely whets the appetite...
Ben Karel
Sure. The idea of an LL grammar is that it must decide which form it's parsing based on the first k tokens of the input. So, if I have two rules that start with the same 'k' tokens, an LL(k) parser can't distinguish them. Where is this going to come up in practice? Generally in attempted syntactic shortcuts. For instance: suppose I'm a big fan of C syntax, but I don't like the semicolons (or whitespace sensitivity). I might discover that if all my statements contain exactly one "statement keyword", then I may be able to formulate this as an LR grammar but not as an LL grammar.
John Clements
Addendum: if all of this sounds a bit grotty, you're right. It is. Additional disclaimer: I'm actually an avid paren-head.
John Clements
Additional additional comment on your "+" example: the "+" of regular expressions is a horse of a very different color; this is what's called "syntactic sugar", and the key difference is that it doesn't increase expressiveness, because any use of the "+" operator can be replaced *in a purely local way* with an equivalent expression that doesn't use it. cf Felleisen "expressiveness".
John Clements
Sure, it doesn't increase expressiveness, but it does reduce the chance for programmer error. I agree that desugaring is at best tangentially related to my original question, but then if I could have given a "real" example involving context-free grammars, I wouldn't have asked the original question! ;-)
Ben Karel
John Clements (or anyone, really): Is there some LR grammar that can't be hacked into an LL grammar that accepts exactly the same texts, albeit perhaps with a wildly different parse tree? (Maybe I'm nit-picking, but by my reading you didn't quite say one way or the other.)
Jason Orendorff
+1  A: 

The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:

A -> Ax | x 
B -> Ay
C -> Az
S -> B | C

When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.

I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.

As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.

This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:

. . . . . | . . . . .

I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.

Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.

Rehno Lindeque
I think that saying that either LR or LL is more natural is rather unfounded. Personally, I find LL much more natural, precisely because I 'read from left to right". So it's probably very subjective overall.
Pavel Minaev
I didn't mean to imply that LR parsing is more natural than LL from a human perspective. For the first part I merely intended to demonstrate a grammar (not language) that cannot be parsed by LL for any fixed k.However, Knuth did actually suggest that LR might be similar to human parsing. Personally I don't really agree: If I really had to speculate I'd say it's some fuzzy hybrid that involves both top-down and bottom-up parsing.
Rehno Lindeque