I want to write my own c++ compiler in c++. Let´s say I´m going to build it in VS. The main idea is that it must have been able to compile itself, which means I cannot use STL, boost, etc. libraries, can I? So, does it mean that I must do everything from scratch - from lexical analyser to binary code generation?
You'll have to bootstrap your compiler. Also, read the Dragon Book (mandatory reference).
which means I cannot use STL, boost, etc. libraries, can I?
Why not? They don't use any magic (as it were), only plain C++. So if your compiler knows all of C++, this shouldn't be a problem. Which is of course completely ludicrous, considering that C++ is probably the hardest to parse programming language in existence today, and that building a compiler from scratch would probably take hundreds (!) of man-years. (I didn't actually invent this number – I read it somewhere. Unfortunately, I'm unable to find the source now.)
But then, I guess the question was completely hypothetical so the hypothetical answer is: no, you can use existing libraries.
So why is C++ so hard to compile?
First off, C++ is a context sensitive language. Producing a parser for such a grammar is much harder. However, this is still one of the simplest aspects. What makes C++ really hard are certain rules relating to declarations/definitions, name lookup (consider argument-dependent name lookup) implicit conversion rules, and of course the resolution of templates. I think no existing compiler gets templates right in all circumstances.
Consider Andrej Alexandrescu's efford to write template-function versions of min
and max
that behave as correctly as the macro version. He claims that no compiler succeeds in compiling his (theoretically correct) code:
It would all be so nice, but there's a little detail worth mentioning. Sadly, Min doesn't work with any compiler I have access to. In fairness, each compiler chokes on a different piece of code.
I'd say you've got two basic options building a C++ compiler. First, build it in C, which would obviously mean it's not self-hosting but porting it to a different OS would be a lot easier as all you'll need is a C compiler on the target platform.
However if you want to build a C++ compiler in C++, I would think that building it using libraries like Boost or the Standard C++ library would be a good thing because you'll have to be able to build code using these libraries anyway. I would consider it a bonus if your compiler would use these libraries and is self hosting as you've got a pretty massive test case there.
The argument against using these libraries - well, at least some of boost - is that you'll potentially end up chasing obscure template problems when you should be working on adding functionality to the compiler. But at least you won't have to test your compiler with those parts of boost that your compiler uses.
Nevertheless, you'll have to implement a lot of functionality from scratch, but at least you'll be able to concentrate on writing the compiler instead of creating a dynamic array...
actually, our lecturer at the university wants us to write the compiler so that it can compile itself.
Does your compiler need to support all features of C++?
If it does need to support/compile every feature of C++, then that's a big project! It's too big to be homework, IMO, so there must be some important fact about this assignment that you haven't mentioned.
If it does not need to support/compile every feature of C++ (if instead it only needs to support a subset of C++, and in particular the subset which it uses in its own implementation, so that it can compile itself) then you should avoid using STL and Boost in your implementation: because the STL and Boost are implemented using templates, so if (and only if) you use them then you will have to include/implement support for compiling templates.
I'm impressed of your enthusiasm.
Have you thought about what file format you want to generate? object files? library files? executables? COFF/ELF/etc
You could look into making a simpler compiler that generates C code, ie like was created for C++ in the beginning. See CFront. This can be a way to initially focus on the parsing of C++.
Added
Sure that your lecturer meant C++? Maybe the task was to invent a easy language you could write the compiler in. I think ie Lisp is usually is used in academia for task like this.
This is an extremely difficult task. It includes:
- Selecting which C++ you wish to support. Stroustrup's ARM? ISO/IEC 14882:2003?
- Finding a correct grammar for your chosen C++ version.
- Writing a lexer/parser that can handle the contextual intricacies.
- Selecting a hardware platform for your code generator.
- Possibly writing a standards-compliant linker, so you can used 3rd party libraries.
It's taken gcc 22 years to get where it is. Visual C++ has taken 17 years (and there was an MS C++ compiler several years before VC++). These products are massive.
chester89, I don't wish to imply that you're not, but it takes a pro and a genius like Walter Bright to just go off and write your own C++ compiler (and then a D compiler).
I think it would help if you write other, simpler compilers first, if only to illustrate the enormity of this task.
You're going to need a lot of coffee.
These are high-quality answers.
If this is a toy project, I would carefully consider the scope.
If it only has to compile itself, there are lots of features you can leave out, like templates.
Can you use an off-the-shelf preprocessor?
Does it have to generate assembly code, or just C? C is much easier, obviously.
For parsing, you're gonna need LR1 (yacc or equivalent).
If you can bound the scope, the main challenges should be in the area of symbol-table management and type inference. Keep it simple (i.e. don't get wrapped up in hash-tables). And don't worry about optimizing.
Writing a compiler is a completely different game. I'm part of a development team that is developing a compiler for a Pascal-like language. And I must tell you that everything sounds very nice theoretically but if you do not know the entire specs beforehand and do not have a suitable design in place you'll get nowhere.
It's not that its impossible for a one-man army to develop a C++ compiler, it's just that it doesn't go all the way. I would recommend what Paul Beckingham said about starting things with a smaller compiler just to get the idea straight. Have a talk with your lecturer and convince him that you will take up and hopefully finish the compiler while you're at the university which in the former case would not be totally possible.
As many have said, it is very dubious that any one man team will get anywhere of significance with a from scratch c++ compiler.
That said, I can see a few different strategies for thinking about this.
- Focus on self-compile aspect. Start with a very tiny language kernel, something even smaller than regular c (forget c++ initially). Once it is able to compile itself, add features one at a time, until you've got c++ (or more likely give up due to the eventual heat-death of the universe.)
- Divide each portion of the system into tiny bits. For instance, a lexer can be easy if you use an off-the-shelf regex library for most of the work. It'll be slow, but you might get it pretty solid in just a few days.
- Many have said to start with a C++ to C translator, similar to Cfront. This is skipping ahead too much. start with a Verifier. This only checks that the program is well formed, and would compile. Adding the Analyzer, which translates to an intermediate abstract syntax tree can be added to a verifier with relative ease, and using an AST to produce machine code output (in C or assembly language or machine code) is also a small step.
- Although I can see the charm of not using libraries, because then you have to self compile too much, I'm not sure i'd make the same decision. For one thing, BOOST will increase your productivity. Boost has all kinds of useful tools to produce and manipulate the data-structures you need to build a compiler.
- One of the trickiest parts of getting a C++ compiler to work is proper template handling. The linker is supposed to remove duplicate template instantiations from different compilation units, so that only one of each instance is present. Duplicates of other code is an error. I'd probably just remove all duplicates, even if they are errors. Or something like this. Vary from the standard in ways that don't break much valid code anywhere that it can help.
Writing your own C++ compiler is hard. Lexing C++ is bad enough as the lexer needs to know what state the parser is in (nested template definitions). There is not yet a proven YACCable grammar for C++ even if given a correct lexer, so you might be in even worse trouble.
That said, I'm going to give you a few pointers.
- Write the C++ preprocesser first.
- Write the parser/lexer with a main program that pretty-prints the parse tree.
- Extend this program to build namespace tables and catch undefined symbols.
- Modify this program to spew out all expressions in postfix notation.
- Write the basic compiler from this guy, outputting some intermediate assembly for this guy. Ignore inline/volatile/const/register for now. You will need "call external" in your assembly.
- Write an interpreter for the intermediate assembly. See if the compiler works.
- Implement #file and #line in the compiler (have the preprocessor spew then as well). This will save you much pain on the next step.
- Add enough features to the compiler to be able to compile itself. Since you don't have modules yet, use the preprocessor (one file #includes all others)
- Archive all this.
- Begin using your compiler as its own bootstrap compiler (this will smoke out bugs really fast).
- Write the basic assembly->machine assembly backend. Pass this through an assembler to get object files you can link into executables.
- Archive all this
- Begin using your compiler with stock assembler to compile itself. This will smoke out even more bugs.
- Write the master binary that can call all the steps.
- Keep adding features until you have all the features you want/need.
Or give up on this project and consider G++. Even if you have a new CPU architecture/OS combination you will find it easier to port G++ to it than writing your own.