views:

680

answers:

11

I came across this: Writing a compiler using Turbo Pascal

I am curious if there are any tutorials or references explaining how to go about creating a simple C compiler. I mean, it is enough if it gets me to the level of making it understand arithmetic operations. I became really curious after reading this article by Ken Thompson. The idea of writing something that understands itself seems exciting.

Why did I put up this question instead of asking Google? I tried Google and the Pascal one was the first link. The rest did no seem relevant and added to that... I am not a CS major (so I still need to learn what all those tools like yacc do) and I want to learn this by doing and am hoping people with more experience are always better at these things than Google. I want to read some article written in the same spirit as the one I listed above but that which highlights at least the bootstrapping phases of building a simple C compiler.

Also, I don't know the best way to learn. Do I start off building a C compiler in C or some other language? Do I write a C compiler or some other language? I feel questions like this are better answered once I have some direction to explore. Any suggestions?

Any suggestions?

+7  A: 

A compiler consists of three pieces:

  1. A parser
  2. A abstract syntax tree (AST)
  3. A code generator

There are lots of nice parser generators that start with language grammars. Maybe ANTLR would be a good place for you to start. If you want to stick to C roots, try lex/yacc or bison.

There are grammars for C, but I think C in its entirety is complex. You'd do well to start off with a subset of the language and work your way up.

Once you have an AST, you use it to generate the machine code that you'll run.

It's doable, but not trivial.

I'd also check Amazon for books about writing compilers. The Dragon Book is the classic, but there are more modern ones available.

UPDATE: There have been similar questions on Stack overflow, like this one. Check out those resources as well.

duffymo
Awesome! Thank you very much... That thread seems like a beast by itself.. Will start digging into it...
Legend
+1 for ANTLR. It might not be the best parser generator but the debugging and testing tools are hard to top.
Ron Warholic
+7  A: 

For what it's worth, the Tiny C Compiler is a pretty full-featured C compiler in a relatively small source package. You might benefit from studying that source, as it's probably significantly easier to understand than trying to comprehend all of GCC's source base, for instance.

Mark Rushakoff
+1 - very nice, indeed.
duffymo
I tried it, but it is very hard to read the source code
Phong
Will jump into it... +1 for this...
Legend
+7  A: 

I advise you this tutorial:

It is a small example on how to implement a "small language" compiler. The source code is very small and is explained step by step.

There is also the C front end library for the LLVM (Low Level Virtual Machine which represent the internal structure of a program) library:

Phong
+1 from me - I like this, too.
duffymo
@duffymo: thanks, What I really liked on this tutorial is that they dont rely on any external soft for the lexer/parser function.
Phong
Many thanks Phong... +1 from me too..
Legend
Yes, it's really terrific. Nice find. Thanks for posting it.
duffymo
+1  A: 

A compiler is a very large project, although I suppose it wouldn't hurt to try.

I know of at least one C compiler written in Pascal, so it's not the most insane thing you could do. I personally would pick a more modern language in which to implement my C compiler project, both for the simplicity (it's easy to d/l packages for Python, Ruby, C, C++ or Java) and because it will look better on your resume.

In order to do a compiler as a beginner project, though, you will need to drink all of the Agile kool-aid.

Always have something running, even if it doesn't do much of anything. Add things to your compiler only in small steps. ("Frequent releases".) Pick a viciously tiny subset of the language and implement that first. (Support only i = 0; at first and expand things from there.)

DigitalRoss
Sorry just trying to understand... Are you suggesting writing something using the higher level languages like Python or Ruby? If yes, sure, I am open to trying that as well.. As I mentioned, I am a complete n00b in this area so anything that makes me understand the main concept is welcome... Also could you clarify the resume part? :D
Legend
A non-trivial program in one of those modern languages will look good on your resume.It would be substantially easier to implement a compiler in something like Ruby or Python. Haskell would be by no means easy, but it might lead you to a much better compiler.
Andrew McGregor
Understood... I've only heard of Haskell... Never tried it though. Writing one in Python or Ruby never occurred to me. If that's the case, I'll definitely look into it. Thanks
Legend
Can we skip the _which language is better_ contest? As much as I personally enjoy Perl or Python, the OP said C was the desired language.
mctylr
Agreed with mctylr, but at least my background makes a functional language look like a much better "host" than Python. Compilers involve building structures quickly and carefully, and the "easy" languages get too hairy when it comes to things like identifier bindings and the underlying meaning of their builtin data structures. Also disagreeing on the Agile part. For a beginner project, this should be one person working one step at a time. *Do* save your test cases. *Don't* obsess over your plan.
Potatoswatter
@Potatoswatter, regarding Agile, I realize he is just one guy so the team parts of Agile won't apply. But I think it's very important that he incrementally prototype the program. He really needs to have something trivial running, then something slightly larger, then slightly larger. If he tries to really write a compiler and then start testing he will be totally doomed.
DigitalRoss
+3  A: 

A compiler is a complex subject matter that covers aspects of

  • Input processing involving Lexing, Parsing
  • Building a symbol store of every variable used such as an Abstract Syntax Tree (AST)
  • From the AST tree, transpose and build a machine code binary based on the syntax

This is by no means exhaustive as it is an abstract bird's eye view from the top of a mountain, it boils down to getting the syntax notation correct and ensuring that malformed inputs do not throw it off, in fact a good input processing should never fall on its knees no matter how malformed, terrible, abused cases of input that gets thrown at it. And, also in deciding and knowing what output is going to be, is it in machine code, which would imply you may have to get to know the processor instructions intimately...including memory addressing for variables and so on...

Here are some links for you to get started:

  • There was a Jack Crenshaw's port of his code for C....(I recall downloading it months ago...)
  • Here's a link to a similar question here on SO.
  • Also, here's another small compiler tutorial for Basic to x86 assembler compiler.
  • Tiny C Compiler
  • Hendrix's Small C Compiler found here.

Hope this helps, Best regards, Tom.

tommieb75
+1  A: 

It might be worthwhile to learn about functional programming, too. Functional languages are well-suited to writing a compiler both in and for. My school's intro compilers class contained an intro to functional languages and the assignments were all in OCaml.

Funny you should ask this today, since just a couple days ago I wrote a lambda calculus interpreter. Lambda calculus is the granddaddy of all functional languages. It's just 200 lines long (in C++, incl. error reporting, some pretty printing, some unicode) and has a two-phase structure, with an intermediate format that could be used to generate code.

Not only is starting small and building up the most practical approach to compilers, it also encourages good, modular, organizational practice.

Potatoswatter
+1 Will keep that in mind... Thanks.
Legend
@Legend: and by the way, don't listen to all the naysayers. You probably aren't going to make a breakthrough or even a competitive language, but neither would someone *with* a degree on their first attempt. So long as you're having fun, you'll always find something more to achieve.
Potatoswatter
So true... Sometimes I do get nervous looking at the 'real' CS grads... In fact, a recent post at slashdot: http://ask.slashdot.org/comments.pl?sid=10/02/19/147251 made me think more. But at the end of the day, I guess it all depends on how we think... :)
Legend
+1  A: 

This is my opinion (and conjecture) it will be hard to write a compiler without understanding data structures normally covered in undergraduate (post secondary) Computer Science classes. This doesn't mean you cannot, but you will need to know essential data structures such as linked lists, and trees.

Rather than writing a full or standards compliant C language compiler (at least in the start), I would suggest limiting yourself to a basic subset of the language, such as common operators, integer only support, and basic functions and pointers. One classic example of this was Ron Cain's Small-C, made popular by a series of articles written in Dr. Dobbs Journal in I believe the 1980s. They publish a CD with the James Hendrix's out-of-print book, A Small-C Compiler.

What I would suggest is following Crenshaw's tutorial, but write it for a C-like language compiler, and whatever CPU target (Crenshaw targets the Motorola 68000 CPU) you wish to target. In order to do this, you will need to know basic assembly of which ever target you want to run the compiled programs on. This could include a emulator for a 68000, or MIPS which are arguably nicer assembly instruction sets than the venerable CISC instruction set of the Intel x86 (16/32-bit).

There are many potential books that can be used as starting points for learning compiler / translator theory (and practice). Read the comp.compilers FAQ, and reviews at various online book sellers. Most introductory books are written as textbooks for sophomore to senior level undergraduate Computer Science classes, so they can be slow reading without a CS background. One older book that might be more introductory, but easier to read than "The Dragon Book" is Introduction to Compiler Construction by Thomas Parsons. It is older, so you should be able to find an used copy from your choice of online book sellers at a reasonable price.

So I'd say, try starting with Jack Crenshaw's Let's Build a Compiler tutorial, write your own, following his examples as a guide, and build the basics of a simple compiler. Once you have that working, you can better decide where you wish to take it from that point.

Added:

In regards to the bootstrapping process. Since there are existing C compilers freely available, you do not need to worry about bootstrapping. Write your compiler with separate, existing tools (GCC, Visual C++ Express, Mingw / djgpp, tcc), and you can worry about self-compiling your project at a much later stage. I was surprised by this part of the question until I realized you were brought to the idea of writing your own compiler by reading Ken Thomas' ACM Turing award speech, Reflections on Trusting Trust, which does go into the compiler bootstrapping process. It's a moderated advanced topic, and is also simply a lot of hassle as well. I find even bootstrapping the GCC C compiler under older Unix systems (Digital OSF/1 on the 64-bit Alpha) that included a C compiler a slow and time consuming, error prone process.

The other sort-of question was what a compiler tool like Yacc actually does. Yacc (Yet Another Compiler Compiler or Bison from GNU) is a tool designed to make writing a compiler (or translator) parser easily. Based on the formal grammar for your target language that you input to yacc, it generates a parser, which is one portion of a compiler's overall design. Next is Lex (or flex from GNU) which used to generate a lexical analyzer or scanner, which is often used in combination with the yacc generated parser to form the skeleton of the front-end of a compiler. These tools make writer a front end arguably easier than writing an lexical analyzer and parser yourself. Crenshaw's tutorial does not use these tools, and you don't need to either, many compiler writers don't always use them. Of course Crenshaw admits the tutorial's parser is quite basic.

Crenshaw's tutorial also skips generating an AST (abstract syntax tree), which simplifies but also limits the tutorial compiler. It lacks most if not all optimization, and is very tied to the specific programming language and the particular assembly language emitted by the "back-end" of the compiler. Normally the AST is a middle piece where some optimization can be performed, and serves to de-couple the compiler front-end and back-end in design. For a beginner without a Computer Science background, I'd suggest not worrying about not having an AST for your first compiler (or at least the first version of it). I think keeping it small and simple will help you finish writing a compiler, in its first version, and you can decide from there how you want to proceed then.

mctylr
It took me quite a while to digest what you've written. Very informative post. Thank you for your time...
Legend
Thank you, I am glad you found it useful. I hope I can answer your question in a manner that is helpful, and encourages you to be successful.
mctylr
+2  A: 

If you want a mind-blowing experience that teaches you how to write compilers that compile themselves, you need to read this paper from 1964.

META II a syntax-oriented compiler writing language by Val Schorre.

In 10 pages, it tells you how to write compilers, how to write meta compilers, provides a virtual metacompiler instruction set, and a sample compiler built with the metacompiler.

I learned how to write compilers from this paper back in the late 60s, and used the ideas to construct C-like langauges for several minicomputers and microprocessors.

If the paper is too much by itself (it isn't!) there's an online tutorial which will walk you through the whole thing.

And if getting the paper from the original link is awkward because you are not an ACM member, you'll find that the tutorial contains all the details anyway. (IMHO, for the price, the paper itself is waaaaay worth it).

10 pages!

Ira Baxter
Isn't the original article (Schorre's) behind ACM's pay-wall? If so, please note it. I happen to be an ACM member, but not everyone is.
mctylr
I'm just not sure if I can post a link here but I'm sure alternate versions are available through Google Scholar... Thanks a lot Ira Baxter.
Legend
+1  A: 

In The Unix Programming Environment, Kernighan and Pike walk through 5 iterations of making a calculator working from simple C based lexical analysis and immediate execution to yacc/lex parsing and code generation for an abstract machine. Because they write so wonderfully I can't suggest smoother introduction. It is certainly smaller than C, but that is likely to your advantage.

msw
Oh... I think I have that book with me somewhere..! Thanks
Legend
+2  A: 

How do I [start writing] a simple C compiler?

There's nothing simple about compiling C. The best simple C compiler is lcc by Chris Fraser and David Hanson. They spent 10 years working on the design to make it as simple as they possibly could, while still generating reasonably good code. If you have access to a university library, you should be able to get their book.

Do I start off building a C compiler in C or some other language?

Some other language. One time I got to ask Hanson what lessons he and Fraser had learned by spending 10 years on the lcc project. The main thing Hanson said was

C is a lousy language to write a compiler in.

You're better off using Haskell or some dialect of ML. Both languages offer functions over algebraic data types, which is a perfect match to the problems faced by the compiler writer. If you still want to pursue C, you could start with George Necula's CIL, which is a big chunk of a C compiler written in ML.

I want to read some article written in the same spirit as the one I listed above but that which highlights at least the bootstrapping phases...

You won't find another article like Ken's. But Andrew Appel has written a nice article called Axiomatic Bootstrapping: A Guide for Compiler Hackers I couldn't find a free version but many people have access to the ACM Digital Library.

Any suggestions?

If you want to write a compiler,

  • Use Haskell or ML as your implementation language.

  • For your first compiler, pick a very simple language like Oberon or like P0 from Niklaus Wirth's book Algorithms + Data Structures = Programs. Wirth is famous for designing languages that are easy to compile.

You can write a C compiler for your second compiler.

Norman Ramsey
I wouldn't call CIL a C compiler. CIL consists of a parser that reads C files and a writer that outputs equivalent C files. It does help with code analysis and transformation.
Wei Hu
I don't think I did call CIL a compiler. It's a front end. With C, that's a "big chunk", as I said.
Norman Ramsey
+2  A: 

You might be interested in the book/course The Elements of Computing Systems:Building a Modern Computer from First Principles.

Note that this isn't about building a "pc" from stuff you bought off newegg. It begins with a description of Boolean logic fundamentals, and builds a virtual computer from the lowest levels of abstraction to progressively higher levels of abstraction. The course materials are all online, and the book itself is fairly inexpensive from Amazon.

In the course, in addition to "building the hardware", you'll also implement an assembler, virtual machine, compiler, and rudimentary OS, in a step-wise fashion. I think this would give you enough of a background to delve deeper into the subject area with some of the more commonly recommended resources listed in the other answers.

Joe Internet
I know I did not really ask this question but you pretty much gave me a solution to a series of questions that I was about to ask in the future :) Looks like a really interesting book. Ordered it yesterday... just want to see how it feels to start from the roots... Thank again...
Legend