views:

1043

answers:

13

How much time would it take to write a C++ compiler using lex/yacc?

Where can I get started with it?

+3  A: 

Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.

anon
IMHO it's Adobes problem that they picked a name for their product that is already widely used.
Nils Pipenbrinck
Well, on SO it our problem. I suspect the number of people using Adobe Flex (not me, I hasten to add) greatly exceeds the users of the flex tool - who's name is not copyrighted or trademarked, as far as I know.
anon
@Nils - I agree, but the discussion on Meta shows that the consensus is for the new technology that will be gone in 5 years rather than the tried-and-true niche program that's been around nearly forever. Meta discussion around it (started by me. I'm famous!): http://meta.stackoverflow.com/questions/23959/whats-this-about-labelled-tags
Chris Lutz
+2  A: 

Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor. It depends on how you do it. How much pre-made components do you plan to use? You need to get the description of the syntax and its token from somewhere.

For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser.... You can get a c preprocessor from boost project.. You need to create a test suite to test your compiler automatically.

It can take a year if you work on it each day or much less you have more talent and motivation.

Aftershock
+1 for mentioning LLVM. I use it for my back end. Great stuff.
Richard Pennington
A compiler does not inately need a linker, assembler or preprocessor. I once wrote a small C compiler that didn't need either.
anon
A: 

Recursive decent is a good choice to parse C++. GCC and clang use it.

The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.

In either case, writing a C++ compiler is a BIG job.

Richard Pennington
+14  A: 

There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >> for example is crucially dependent on parsing context.

Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.

yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.

C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.

Omnifarious
If I am not mistaken, gcc switched to using a recursive descent parser sometime late in the 3.x series.
Omnifarious
gcc uses lex and yacc for the C++ front-end. For all the ambiguities you mention there are explicit commands to resolve the conflicts. Personally I doubt there is a better framework for parsing C++ out there. __BUT__ writing a C++ lexer/parser without lots of compiler experience is a non starter for a single developer using lex/yacc (its just to large and complicated).
Martin York
@Martin York, in fact the bison/yacc parser was replaced with a recursive descent parser in gcc-3.4 - http://gcc.gnu.org/gcc-3.4/changes.html
Omnifarious
Recursive descent parsers are also much, much easier to understand. In fact, if you're interested in language development, I'd probably recommend starting with hand-generating a recursive descent parser for a relatively simple grammar.
kyoryu
@Martin, yacc is an interesting tool but the bigger the project, the more work it is to get yacc to DTRT. For a simple problem, every approach is easy, including yacc. For a tough problem, like C++, the choices are: use something more modern like a PEG tool, fight a lot with yacc and stir in some strange hacks, or just type in a straightforward RD parser. Personally, I would rather write code (RD) than bash my head against yacc, but I've done it both ways...never for C++ tho
DigitalRoss
@kyoryu: Recursive descent parsers are NOT that much easier to understand than the pure grammar, especially for an artifact of the scale of C++. You really want a parser generator driven from the language definition in terms of BNF rules. The people that say C++ is hard to parse are those that use YACC (and variants) and those that tangle name/type resolution with parsing. GLR parser generators allow you to build a very good parser using the BNF rules, and isolate name/type resolution; this separate makes each task a lot easier (although not easy). See my answer here.
Ira Baxter
@Ira: I was comparing recursive descent parsers to LALR parsers. Also, with parsers, writing at least a trivial one by hand can give you insight into what a parser generator is doing. I've also argued against entering the world of parsers/compilers with C++, just because (as you rightly point out) it is an incredibly complex language.
kyoryu
+1  A: 

A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.

Brian Campbell
+6  A: 

A long time, and lex and yacc won't help

If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.

I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.

For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.

DigitalRoss
I completely disagee. As do the gcc team. Where the C++ front-end is lex and yacc.
Martin York
Martin: it's not. "the bison/yacc parser was replaced with a recursive descent parser in gcc-3.4"
Nicolás
Martin: it's certainly possibly to use yacc, and gcc has done it both with and without. Ruby has a complex grammar and the main implementation does use yacc. I've written parsers both ways, certainly there are no easy "always do it this way" answers, I just think it's important to realize that the parser will be roughly the same amount of effort either way. The real problem with yacc is that while easy stuff is really easy, you can also get stuck on hard-to-understand bugs. With RD you just fix the code.
DigitalRoss
I think the main point is not whether or not flex/yacc are good tools, but to point out that they're a pretty small portion of the overall problem. Great, you've parsed a file into some intermediate representation (AST/whatever) - *now what?*
kyoryu
+3  A: 

This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.

You're not the first person to think that this is fun. Here's a nice blog-style article on the topic: Parsing C++

Here's an important quote from the article:

"After lots of investigation, I decided that writing a parser/analysis-tool for C++ is sufficiently difficult that it's beyond what I want to do as a hobby."

The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:

James Thompson
+8  A: 

It will probably take you years, and you'll probably switch to some other parser generator in the process.

Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).

Existing C++ Front Ends:

  1. The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.

  2. GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.

  3. Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.

  4. Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.

tgamblin
gcc_xml parses everything except the code, so it's not useful for a compiler. You only get function and type declarations.
Nicolás
+8  A: 

It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.

Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.

kyoryu
A: 

Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.

Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).

If you do a google for "C++ grammars" there are a couple around to get you started.

C++ LEX  Tokens:   http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer:  http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
                   http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
Martin York
A: 

Well, what do you mean by write a compiler?

I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.

Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.

The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.

With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.

Charles Eli Cheese
A: 

As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).

Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.

But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.

But then you have the rest of the compiler to consider:

  • Preprocessor
  • AST construction
  • Semantic analysis and type checking
  • Control, Data flow, and pointer analysis
  • Basic code generation
  • Optimizations
  • Register allocation
  • Final Code Generation
  • Debugging support

I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.

Expect building a C++ compiler to take you years.

(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).

Ira Baxter
Nice if you can afford this, Ira, are you affilitated with this crowd at semanticdesigns? http://stackoverflow.com/questions/526797/good-tools-for-creating-a-c-c-parser-analyzer, http://stackoverflow.com/questions/792454/most-effective-way-to-parse-c-like-definition-strings
tommieb75
I *am* the crowd at Semantic Designs. Check my bio here where this is clearly stated. Agreed, nice if you can afford it. The alternative (build the whole thing yourself) is nice if you can afford it, too, but you *can't*; neither you nor your employer can afford for you to spend the huge amount of time it takes to build such tools. And it makes even less sense if you intend to do it as a hobby, unless its a life-long task. A question of the form of "how do implement a simple compiler" wouldn't draw this response.
Ira Baxter
A: 

A few years - if you can get research grant to re-write new lex/yacc :-)

People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).

The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.

What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).

Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))

ZXX