views:

2639

answers:

10

I want to write my own c++ compiler in c++. Let´s say I´m going to build it in VS. The main idea is that it must have been able to compile itself, which means I cannot use STL, boost, etc. libraries, can I? So, does it mean that I must do everything from scratch - from lexical analyser to binary code generation?

+7  A: 

You'll have to bootstrap your compiler. Also, read the Dragon Book (mandatory reference).

eljenso
No, you don't have to bootstrap the compiler since the system already offers a fully-functional compiler. Bootstrapping works around the *lack* of a fully-functional compiler.
Konrad Rudolph
Maybe I didn't understand the question, but he wants to compile his compiler using his own compiler. So he uses an existing C++ compiler to bootstrap his own? Right?
eljenso
Yes. Sorry for the confusion. What's not required though is several iterations. Theoretically, one iteration suffices. Of course, that's quite unlikely as you noticed – but so is writing a C++ compiler in the first place.
Konrad Rudolph
+30  A: 

which means I cannot use STL, boost, etc. libraries, can I?

Why not? They don't use any magic (as it were), only plain C++. So if your compiler knows all of C++, this shouldn't be a problem. Which is of course completely ludicrous, considering that C++ is probably the hardest to parse programming language in existence today, and that building a compiler from scratch would probably take hundreds (!) of man-years. (I didn't actually invent this number – I read it somewhere. Unfortunately, I'm unable to find the source now.)

But then, I guess the question was completely hypothetical so the hypothetical answer is: no, you can use existing libraries.

So why is C++ so hard to compile?

First off, C++ is a context sensitive language. Producing a parser for such a grammar is much harder. However, this is still one of the simplest aspects. What makes C++ really hard are certain rules relating to declarations/definitions, name lookup (consider argument-dependent name lookup) implicit conversion rules, and of course the resolution of templates. I think no existing compiler gets templates right in all circumstances.

Consider Andrej Alexandrescu's efford to write template-function versions of min and max that behave as correctly as the macro version. He claims that no compiler succeeds in compiling his (theoretically correct) code:

It would all be so nice, but there's a little detail worth mentioning. Sadly, Min doesn't work with any compiler I have access to. In fairness, each compiler chokes on a different piece of code.

Konrad Rudolph
actually, our lecturer at the university wants us to write the compiler so that it can compile itself. will it be possible if I use boost, for example?
chester89
Your lecturer might just want you to find out the hard way how difficult this is – or (s)he wants you to implement only a severly limited subset (in that case, yes, STL and Boost are out). In any case, check with the specs carefully!
Konrad Rudolph
@chester89: No. You can use any feature which your compiler will support... Write your compiler in c++, compiler with the system compiler, to get a "dirty" build. Use the dirty build to generate a clean build, use the clean build to generate a test build. If the last two are the same, you win.
dmckee
I don't see why C++ is any harder to parse than several other languages. What's your basis for that assertion?
SoapBox
As I recall, C++ is actually a context-sensitive grammar. Not a context-free grammar. That makes it massively more complex.
Paul Nathan
SoapBox - templates! It's hard to get it right, because templates are one of the hardest things... Especially, template arguments for templates etc...
Paulius Maruška
@SoapBox, Paul, Paulius: I've updated my answer appropriately. Thanks for the comments.
Konrad Rudolph
Unless I don't understand the question, he wants to compile his compiler using his own compiler. That's why, unless he supports the entire C++ language from the first iteration (*quite* unlikely), he can't use STL and other stuff.
eljenso
dammit overload resolution in C++ is so freaking complex :) +1
Johannes Schaub - litb
i know of a c++ parser that's written in haskell. it parses the language, but purposely leaves out some stuff so it still makes sense for its use-case. it's educational to look into its code :) http://www.xs4all.nl/~weegen/eelis/geordi/
Johannes Schaub - litb
but comeau targets the c++ language with 100% compatibility i think. of course, it contains also bugs. but they are not known. and if, they want to fix them asap hehe :)
Johannes Schaub - litb
Litb: Geordi actually uses GCC under the hood. Haskell is just used to create the sandbox and for some environment modifications, settings, execution, output etc.
Konrad Rudolph
Konrad, nono look it has a haskell parser. so you can write "remove declaration of F", or "make F a void()" and it transform the code. without help of GCC. of course it uses GCC to compile code, but it has its own C++ parser that does stuff. i like it :)
Johannes Schaub - litb
C++ is hard to parse. It's a fact. As mentioned, it's due to how context-sensitive the language is. There's a reason it took some years for a decent C++ compiler to emerge.
fow
A: 

I'd say you've got two basic options building a C++ compiler. First, build it in C, which would obviously mean it's not self-hosting but porting it to a different OS would be a lot easier as all you'll need is a C compiler on the target platform.

However if you want to build a C++ compiler in C++, I would think that building it using libraries like Boost or the Standard C++ library would be a good thing because you'll have to be able to build code using these libraries anyway. I would consider it a bonus if your compiler would use these libraries and is self hosting as you've got a pretty massive test case there.

The argument against using these libraries - well, at least some of boost - is that you'll potentially end up chasing obscure template problems when you should be working on adding functionality to the compiler. But at least you won't have to test your compiler with those parts of boost that your compiler uses.

Nevertheless, you'll have to implement a lot of functionality from scratch, but at least you'll be able to concentrate on writing the compiler instead of creating a dynamic array...

Timo Geusch
+4  A: 

actually, our lecturer at the university wants us to write the compiler so that it can compile itself.

Does your compiler need to support all features of C++?

  • If it does need to support/compile every feature of C++, then that's a big project! It's too big to be homework, IMO, so there must be some important fact about this assignment that you haven't mentioned.

  • If it does not need to support/compile every feature of C++ (if instead it only needs to support a subset of C++, and in particular the subset which it uses in its own implementation, so that it can compile itself) then you should avoid using STL and Boost in your implementation: because the STL and Boost are implemented using templates, so if (and only if) you use them then you will have to include/implement support for compiling templates.

ChrisW
A: 

I'm impressed of your enthusiasm.

Have you thought about what file format you want to generate? object files? library files? executables? COFF/ELF/etc

You could look into making a simpler compiler that generates C code, ie like was created for C++ in the beginning. See CFront. This can be a way to initially focus on the parsing of C++.

Added

Sure that your lecturer meant C++? Maybe the task was to invent a easy language you could write the compiler in. I think ie Lisp is usually is used in academia for task like this.

epatel
CFront was a plausible approach back then. C++ is a hell of a lot more complex today, so that approach would be just as difficult today as a "traditional" compiler.
jalf
You think? If you can leave out CPU things like registers, ABI, etc, file formats for objectfiles/libraries/executables. Just focus on the language...but, I can't imagine that creating a full C++ compiler is really what the lecturer meant for a class assignment
epatel
+9  A: 

This is an extremely difficult task. It includes:

  • Selecting which C++ you wish to support. Stroustrup's ARM? ISO/IEC 14882:2003?
  • Finding a correct grammar for your chosen C++ version.
  • Writing a lexer/parser that can handle the contextual intricacies.
  • Selecting a hardware platform for your code generator.
  • Possibly writing a standards-compliant linker, so you can used 3rd party libraries.

It's taken gcc 22 years to get where it is. Visual C++ has taken 17 years (and there was an MS C++ compiler several years before VC++). These products are massive.

chester89, I don't wish to imply that you're not, but it takes a pro and a genius like Walter Bright to just go off and write your own C++ compiler (and then a D compiler).

I think it would help if you write other, simpler compilers first, if only to illustrate the enormity of this task.

You're going to need a lot of coffee.

Paul Beckingham
Deciding on how to build the intermediate levels of the compiler is also a challenging task. Starting off on smaller compilers should be sweet.
Elroy
thanks for comment, I also think that university man didn`t mean us to write Visual Studio from scratch - he used to be a programmer, but now he`s not, and what he does in his lectures and practical classes is telling us how programming dissapointed him, philosophizing about life, etc.:)
chester89
A: 

These are high-quality answers.

If this is a toy project, I would carefully consider the scope.

If it only has to compile itself, there are lots of features you can leave out, like templates.

Can you use an off-the-shelf preprocessor?

Does it have to generate assembly code, or just C? C is much easier, obviously.

For parsing, you're gonna need LR1 (yacc or equivalent).

If you can bound the scope, the main challenges should be in the area of symbol-table management and type inference. Keep it simple (i.e. don't get wrapped up in hash-tables). And don't worry about optimizing.

Mike Dunlavey
+1  A: 

Writing a compiler is a completely different game. I'm part of a development team that is developing a compiler for a Pascal-like language. And I must tell you that everything sounds very nice theoretically but if you do not know the entire specs beforehand and do not have a suitable design in place you'll get nowhere.

It's not that its impossible for a one-man army to develop a C++ compiler, it's just that it doesn't go all the way. I would recommend what Paul Beckingham said about starting things with a smaller compiler just to get the idea straight. Have a talk with your lecturer and convince him that you will take up and hopefully finish the compiler while you're at the university which in the former case would not be totally possible.

Elroy
+3  A: 

As many have said, it is very dubious that any one man team will get anywhere of significance with a from scratch c++ compiler.

That said, I can see a few different strategies for thinking about this.

  1. Focus on self-compile aspect. Start with a very tiny language kernel, something even smaller than regular c (forget c++ initially). Once it is able to compile itself, add features one at a time, until you've got c++ (or more likely give up due to the eventual heat-death of the universe.)
  2. Divide each portion of the system into tiny bits. For instance, a lexer can be easy if you use an off-the-shelf regex library for most of the work. It'll be slow, but you might get it pretty solid in just a few days.
  3. Many have said to start with a C++ to C translator, similar to Cfront. This is skipping ahead too much. start with a Verifier. This only checks that the program is well formed, and would compile. Adding the Analyzer, which translates to an intermediate abstract syntax tree can be added to a verifier with relative ease, and using an AST to produce machine code output (in C or assembly language or machine code) is also a small step.
  4. Although I can see the charm of not using libraries, because then you have to self compile too much, I'm not sure i'd make the same decision. For one thing, BOOST will increase your productivity. Boost has all kinds of useful tools to produce and manipulate the data-structures you need to build a compiler.
  5. One of the trickiest parts of getting a C++ compiler to work is proper template handling. The linker is supposed to remove duplicate template instantiations from different compilation units, so that only one of each instance is present. Duplicates of other code is an error. I'd probably just remove all duplicates, even if they are errors. Or something like this. Vary from the standard in ways that don't break much valid code anywhere that it can help.
TokenMacGuy
+2  A: 

Writing your own C++ compiler is hard. Lexing C++ is bad enough as the lexer needs to know what state the parser is in (nested template definitions). There is not yet a proven YACCable grammar for C++ even if given a correct lexer, so you might be in even worse trouble.

That said, I'm going to give you a few pointers.

  • Write the C++ preprocesser first.
  • Write the parser/lexer with a main program that pretty-prints the parse tree.
  • Extend this program to build namespace tables and catch undefined symbols.
  • Modify this program to spew out all expressions in postfix notation.
  • Write the basic compiler from this guy, outputting some intermediate assembly for this guy. Ignore inline/volatile/const/register for now. You will need "call external" in your assembly.
  • Write an interpreter for the intermediate assembly. See if the compiler works.
  • Implement #file and #line in the compiler (have the preprocessor spew then as well). This will save you much pain on the next step.
  • Add enough features to the compiler to be able to compile itself. Since you don't have modules yet, use the preprocessor (one file #includes all others)
  • Archive all this.
  • Begin using your compiler as its own bootstrap compiler (this will smoke out bugs really fast).
  • Write the basic assembly->machine assembly backend. Pass this through an assembler to get object files you can link into executables.
  • Archive all this
  • Begin using your compiler with stock assembler to compile itself. This will smoke out even more bugs.
  • Write the master binary that can call all the steps.
  • Keep adding features until you have all the features you want/need.

Or give up on this project and consider G++. Even if you have a new CPU architecture/OS combination you will find it easier to port G++ to it than writing your own.

Joshua