views:

542

answers:

8

I am interested in writing a very minimalistic compiler.

I want to write a small piece of software (in C/C++) that fulfills the following criteria:

  • output in ELF format (*nix)
  • input is a single textfile
  • C-like grammar and syntax
  • no linker
  • no preprocessor
  • very small (max. 1-2 KLOC)

Language features:

  • native data types: char, int and floats
  • arrays (for all native data types)
  • variables
  • control structures (if-else)
  • functions
  • loops (would be nice)
  • simple algebra (div, add, sub, mul, boolean expressions, bit-shift, etc.)
  • inline asm (for system calls)

Can anybody tell me how to start? I don't know what parts a compiler consists of (at least not in the sense that I just could start right off the shelf) and how to program them. Thank you for your ideas.

+4  A: 

These are the absolutely essential parts:

  • Scanner: This breaks the input file into tokens
  • Parser: This constructs an abstract syntax tree (AST) from the tokens identified by the scanner.
  • Code generation: This produces the output from the AST.

You'll also probably want:

  • Error handling: This tells the parser what to do if it encounters an unexpected token
  • Optimization: This will enable the compiler to produce more efficient machine code

Edit: Have you already designed the language? If not, you'll want to look into language design, too.

Bruce Alderman
'look into language design': Do you mean a specific resource or paradigm? Or just something I need to revolve in my head?
prinzdezibel
You'll have to create a language grammar that is compatible with the type of parser you want to use. I'd take a look at top-down vs bottom-up parsers to get started.
Bruce Alderman
+1  A: 

The number one essential is a book on compiler writing. A lot of people will tell you to read the "Dragon Book" by Aho et al, but the best book I've read on compilers is "Brinch Hansen on Pascal Compilers". I suspect it's out of print (Amazon is your friend), but it takes you through all the steps of designing and writing a compiler using recursive descent, which is the easiest method for compiler newbies to understand.

Although the book uses Pascal as the implementation and target languages, the lessons and techniques presented apply equally to all other languages.

anon
+1 for Brinch Hansen. It strikes the best balance between technical and practical information on compiler design.
Bruce Alderman
+2  A: 

I don't know what you hope to get out of this, but if it is learning, and looking at existing code works for you, there is always tcc.

dmckee
+3  A: 

Firstly, you need to decide whether you are going to make a compiler or an interpreter. A compiler translates your code into something that can be run either directly on hardware, in an interpreter, or get compiled into another language which then is interpreted in some way. Both types of languages are turing complete so they have the same expressive capabilities. I would suggest that you create a compiler which compiles your code into either .net or Java bytecode, as it gives you a very optimized interpreter to run on as well as a lot of standard libraries.

Once you made your decision there are some common steps to follow

  1. Language definition Firstly, you have to define how your language should look syntactically.

  2. Lexer The second step is to create the keywords of your code, known as tokens. Here, we are talking about very basic elements such as numbers, addition sign, and strings.

  3. Parsing The next step is to create a grammar that matches your list of tokens. You can define your grammar using e.g. a context-free grammar. A number of tools can be fed with one of these grammars and create the parser for you. Usually, the parsed tokens are organized into a parse tree. A parse tree is the representation of your grammar as a data structure which you can move around in.

  4. Compiling or Interpreting The last step is to run some logic on your parse tree. A simple way to make your own interpreter is to create some logic associated to each node type in your tree and walk through the tree either bottom-up or top-down. If you want to compile to another language you can insert the logic of how to translate the code in the nodes instead.

Wikipedia is great for learning more, you might want to start here.

Concerning real-world reading material I would suggest "Programming language processors in JAVA" by David A Watt & Deryck F Brown. I used that book in my compilers course and learning by example is great in this field.

Per Stilling
+1  A: 

The examples are all in Perl, but Exploring Programming Language Architecture in Perl is a good book (and free).

daotoad
+5  A: 

With all that you hope to accomplish, the most challenging requirement might be "very small (max. 1-2 KLOC)". I think your first requirement alone (generating ELF output) might take well over a thousand lines of code by itself.

One way to simplify the problem, at least to start with, is to generate code in assembly language text that you then feed into an existing assembler (nasm would be a good choice). The assembler would take care of generating the actual machine code, as well as all the ELF specific code required to build an actual runnable executable. Then your job is reduced to language parsing and assembly code generation. When your project matures to the point where you want to remove the dependency on an assembler, you can rewrite this part yourself and plug it in at any time.

If I were you, I might start with an assembler and build pieces on top of it. The simplest "compiler" might take a language with just a few very simple possible statements:

print "hello"
a = 5
print a

and translate that to assembly language. Once you get that working, then you can build a lexer and parser and abstract syntax tree and code generator, which are most of the parts you'll need for a modern block structured language.

Good luck!

Greg Hewgill
Even easier, have it generate C as its output. Lots of successful compilers have gone this route.
anon
Note that NASM is written in C, so you might be able to use code from NASM in your translation to machine code.
Chris Lutz
A: 

I always recommend flex and bison for this kind of work as a beginner. You can always learn the ins and outs of writing your own scanner and parser later, although they may increase the code size at least they will be generated for you by tools. :)

jheriko
+1  A: 

A really good set of free references, IMHO, are:

Overall compiler tutorial: Let's Build a Compiler by Jack Crenshaw (http://compilers.iecc.com/crenshaw/) It's wordy, but I like it.

Assembler: NASM (nasm.us) good for Linux and Windows/DOS, and most importantly lots of doco and examples/tutorials. (FASM is also good but less documentation/tutorials out there)

Other sources The PC Assembly book (http://www.drpaulcarter.com/pcasm/index.php)

I'm trying to write a LISP, so I'm using the Lisp 1.5 Manual. You may want to get the language spec for whatever language you're writing.

As far as 1-2KLOC, assuming you use a high level language (like Py or Rb) you should be close if you're not too ambitious.

Since he wants to write it in C/C++ (whatever that means), I would go with NASM. FASM is good, but is written in assembly, whereas NASM is written in C. NASM may provide more useful code.
Chris Lutz