views:

307

answers:

7

Yesterday I asked about C++ context sensitivity, see here. Among many excellent answers, here is the accepted one, by dmckee.

However, I still think there's something to be said about this (maybe some terminological confusion?). The question amounts to: what part of compilation deals with the ambiguity?

To clarify my terminology: A CFG is a grammar that has only one non-terminal symbol on the left-hand-side of the rule (eg. A->zC), a CSG is one that has a terminal (plus a non-terminal) on the left-hand-side (aAv->QT), where uppercase letters are nonterminals and lowercase are terminals.

Is any representation like the latter in the grammar parsing C++ source code?

Thank you, and sorry to push the issue.

A: 

I'd dare to say that any context sensitive issue should be resolved on the semantic analysis.

Anzurio
A: 

Since no else is stepping up (and admitting that I am very sketchy on what "real" compilers do):

The ambiguity surrounding a statement like

B = A();

is resolved by consulting the symbol table to find out the types of any non-keyword, non-operators in the statement (here B and A), and attempting to form an acceptable assignment out of them. In the event that no such arrangement can be found, issue an error.

This is presumably done during AST formation early in the parsing process, but after the completion of lexing.

dmckee
right, but if so, the grammar *is* context-free, some processing is done before or even during the construction of the AST... Maybe I should stop speculating and go to the grammar of C++ itself. I have the feeling that I will find a CFG with some "in-between" processing. Thanks again, D.
Dervin Thunk
That is how compilers work. The parser (i.e. syntactical analysis) uses a context free grammar, but a lot of checks are made due to the knowledge available from the symbol table.
txwikinger
A: 

what part of compilation deals with the ambiguity?

There should be no ambiguity after building the syntax tree. So AFAIK the final syntax tree should be ready for translation. In short, the syntax analyzer should solve the context-sensitive grammar(AKA ambiguous grammar) during the derivation process(AKA syntax analysis).

AraK
Context sensitivity and ambiguity aren't the same thing.
AProgrammer
A: 

You seem to assume that a compiler is built of phases that are strictly based on the grammar. Nothing could be further from the truth. The grammar is used to parse some bits of the input, and then all sorts of heuristics are used to parse other bits, then more grammar parsing is done and so on. The ideal compiler, described lovingly in text books, does not exist.

A simple example is a recursive descent compiler that uses a table driven parser for arithmetic expressions. Or indeed, vice versa.

anon
Sorry, Neil, what are you reacting against? My question, dmckee's answer or my note to dmackee's answer? Just clarifying.
Dervin Thunk
+1  A: 

First there is the difference between the language and the grammar. A language is a set of strings. A grammar is a way of describing a set of strings (one often say that a grammar "generates" the strings). A given language may be described by several grammars.

The most well known kind of grammar are the production based one. Those where classified by Chomsky in

  • unrestricted grammars, where there can be anything on the two sides of the productions

  • monotonic grammar, where the left hand side is at most as long as the right hand side

  • context-sensitive, where only one non-terminal is expanded

  • context-free, where the left hand side of productions consist only of one non terminal

  • regular grammars, where the left hand side of productions consist only of one non terminal and right hand side of production may have only one non-terminal, as the latest element.

Monotonic and context-sensitive grammars are also called type 1 grammars. They are able to generate the same languages. They are less powerfull than type 0 grammars. AFAIK, while I've seen proofs that there are languages which have a type 0 grammar but no type 1 one, I know of no example.

Context-sensitive grammars are called type 2 grammars. They are less powerfull than type 1 grammar. The standard example of language for which there is no type 2 grammar but a type 1 grammar is the set of strings consisting of an equal number of a, b and c, with the a before the b and the b before the c.

Regular grammar are also called type 3 grammars. They are less powerfull than type 2 grammars. The standard example of a language for which there is no type 3 grammar but a type 2 grammar is the set of strings with correctly matching parenthesis.

Ambiguity in grammars is something outside that hierarchy. A grammar is ambiguous if a given string can generated in several ways. There are unambigous type 1 grammars, and there are ambiguous type 3 grammars.

Then there are other kinds of grammars, which aren't part on Chomsky classification (two levels grammars, attribute grammars, tree adjoining grammars, ...) even if they are based on productions. Some of these are even able to describe the semantic of programming languages.

Then there are parsing algorithms. Those often are based on CFG and impose more restrictions to get better parsing speed (parsing a CSG needs exponential time, a CFG needs cubic time, common algorithms only linear time). Those restrictions introduce other classes of grammars.

CSG and monotonic grammars are in fact of little use to describe or compile a programming language: their global behaviour isn't apparent and is synthesised from local properties, so they are difficult to understand and attaching semantic to production is problematic, parsing them is costly -- in general exponential -- and error handling is difficult. The non Chomsky grammars were introduced to solve these issues.

Back to C++. The standard describes the C++ language with a context-free grammar but

  • there are ambiguities (the famous "most vexing parse"). So a compiler has to recognize the ambiguities and use the right interpretation (i.e. C x(); is a function declaration, not an object definition).

  • the grammar is not LR(1) (one of the most well known subset of CFG for which a linear parsing algorithm exist). Other algorithms (potentially more costly in time or space) are used, either based on a more general theory or by tweeking linear one to adapt them to C++ rules. Simplifying the grammar and rejecting the incorrectly accepted programs in the semantic analysis is also a possibility.

  • the correspondance between strings of characters and terminals is modified (mainly by type and template declarations, the need to take that into account in template definition has been solved with the use of typename and template for dependent names). This is solved by having the lexing phase query the symbol table so that a given string will give a terminal or another depending on the context.

  • there are additional constraints (need to declarare some identifiers, type checking, ...) described in a more of less formal variant of english. This is usually considered semantic even if some more powerfull grammar descriptions could handle them.

AProgrammer
+3  A: 

No C++ front end (parser, name/type resolver) that I know of (including the one we built) implements a context sensitive parser using CSG grammar rules as you defined it. Pretty much they operate explicitly or implicitly with a context free grammar which still has ambiguities.

Many of them use a combination of top-down parsing and interwoven collection of type information to distinguish between the ambiguous cases.

Weaving the parsing and type collection together makes building a parser this way really messy and hard, producing the folk theorem "C++ is hard to parse".

Like many such "theorems", it isn't true, unless you also insist on parsing with one hand tied behind your back (e.g., recursive descent, LL(k), LALR [e.g., YACC]). If you use a parsing technology that can handle ambiguities, the C++ grammar is actually not so hard. We (and others, such as the Elsa C++ parser) use GLR parsing technology for this reason. (We go a bit further and capture macro definitions, uses and preprocessor conditionals in the C++ grammer because we are interested in transforming the code before the processor has ruined it; normally preprocessor directives are treated completely separately in an independent preprocessor).

Elsa I think still interleaves the ambiguity resolution into the parsing process, but because the parsing is so clean this is easier to do. Our front end builds an AST with ambiguity nodes, and after the tree is built, we walk the tree using an attribute grammar evaluator to collect names, and types, and eliminate those branches of ambiguities that are type inconsistent. The latter is a really beautiful scheme, and completely decouples parsing and name resolution.

What is hard is actually doing the name resolution. C++ has a pretty arcane scheme for looking things up, and its spread across the 600 pages of the standard, as well as bent in various ways by various dialects. Keeping name resolution separate from parsing makes this ugliness more manageable, but the folk theorem should be "C++ is hard to name resolve".

Ira Baxter
A: 

@Ira Baxter said: "No C++ front end (parser, name/type resolver) that I know of (including the one we built) implements a context sensitive parser using CSG grammar rules as you defined it."

There is a context-sensitive grammar with the Meta-S distribution that deals with a number of C++ parsing issues "in the grammar" rather than at any other phase of the parse. For example -- forward member data and member function references can be dealt with in inline member functions are dealt with during the parsing phase alone.

Traditional parses deal with many CS issues in the Abstract-Syntax-Tree pass or at other phases of the compilation.

So my answer is: in "most" systems, ambigutity is dealt with after the parse. In at least one system, much of it is dealt with in the parse proper. Ergo, "Is any representation like the latter in the grammar parsing C++ source code?" the answer is --

Yes. At least in Meta-S, type 0 CS parsing is posible in the grammar.

Quinn Tyler Jackson
Code example of what I meant:class Foo{ public: void foo(void) { bar(); } private: void bar(void) { ++m_i; } int m_i;};Both foo() and m_i are referenced before they are declared above, and yet this is legal C++. This can and has been dealt with by a parser alone.
Quinn Tyler Jackson