views:

461

answers:

10

I've been programming for years (mainly Python), but I don't understand what happens behind the scenes when I compile or execute my code.

In the vein of a question I asked earlier about operating systems, I am looking for a gentle introduction to programming language engineering. I want to be able to define and understand the basics of terms like compiler, interpreter, native code, managed code, virtual machine, and so on. What would be a fun and interactive way to learn about this?

+3  A: 

This site has a great series of lectures on the Structure and Interpretation of Computer Programs, which is exactly the type of thing you are wanting to learn. The accompanying textbook is useful too, tho i havent personally read thru the whole thing. I think watching the lectures is pretty good, gets you about 60% of the way there.

Chii
A: 

You can find many lectures .For example at Itunes U

SomeUser
+1  A: 

Whoa, this is a huge question with tons of written books all about this. I really doubt you will get a decent answer in SO about this. You need to get to your local book store or pick up a few comp sci classes.

To give you a quick intro:

  • Compiler: A program that converts written code into instructions that are natively understood by the processor.
  • Interpreter: A program that reads written code and, on-the-fly, translates and gives corresponding processor-native instructions.
  • Managed code: Code that runs in a virtual machine, e.g. to give cross-platform compatibility (Java).
  • Virtual machine: A program that emulates the behavior, or rather the API, of a full-blown computer environment. Among other things, this gives some security advantages and cross platform compatibility.
Henrik Paul
+2  A: 

http://en.wikipedia.org/wiki/Dragon_Book_(computer_science) will explain a lot of those concepts, you should give it a read, it was a real eye opener for me.

Blindy
+4  A: 

compilers, interpreters and virtual machines are just examples of implementation details. What you might look for is programming languages theory, generative grammar, language translators, and you need possibly some computer architecture to relate theory with implementations.

Personally, I learned from Sebesta's book. It gives a very wide introduction to the subject without going into minute details. It also, has a good chapter on the history of programming languages (~20 languages ~3 papers per language). It has nice explanation about grammars and theory of languages in general. Also, It gives a good introduction into Scheme, Prolog, and programming paradigms (Logic, Functional, Imperative^, Object oriented).

^ It concentrate a lot more on the imperative paradigm than the first two.

AraK
+8  A: 

Code to execution in a nutshell

A program (code) is fed into the compiler (or interpretor).

Characters are used to form tokens (+ , identifiers, numbers) and their value is stored in some thing called a symbol table.

These tokens are put together to form statements: (int a = 6 + b * c;). Mostly in the form of a syntax tree:

                     =
                    / \
                   /   \ 
                  a     +
                       / \
                      /   \
                     6     *
                          / \
                         b   c

Within an interpretor the tree is executed directly.

With a compiler, the tree is finally translated into either intermediate code or assembler code.

You now have one or more "object files". These contain the assembler code without the precise jumps (because these values are not known yet especially if the targets are in other object files). The object files are linked together with a linker which fills in the blanks for the jumps (ans references). The output of the linker is a library (which can be linked too) or an executable file.

If you start the executable, the program data is copied into memory and there is some other link jugling to match the pointers with the correct memory locations. And then control is given to the first instruction.

Gamecat
A: 

When I learned about programming, somewhere in the second half of the previous century, I learned that everything needs to be converted to machine code. Script languages would just decide which code to call based upon the scripted code. Compiled code would first be compiled to p-code, which stands for pre-compiled code, which needs to be linked to other precompiled code to create a full application. I liked Turbo Pascal back then, simply because Turbo Pascal compiled directly to machione code and it didn't use the intermediate p-code in-between. That is, until Turbo Pascal 4.0, which created *.tpu compiled units. Most other compilers would compile to the .obj format instead.

When Java was created, something relatively new started to become popular. Basically, a Java compiler just compiles code to some binary script file. This script could then be interpreted, although that mechanism soon changed too.

Nowadays, interpreters are nearly extinct. Most scripted languages will first be compiled to machine code, the machine code is then stored in some cache and thus it can be executed real fast, without the system having to re-interpret any repeating instructions. This works well for text and binary scripts. PHP would be an example of a text-based script. Java and .NET are binary scripts, since you generally compile the code to this binary script format. (They'll call it different, but I think binary scripts sounds better.)

In general, the trick is to convert the code to machine code, using whatever means possible. There have been many ways to do so and it's a bit complex to make it all clear.

I also remember the time when I could write a C++ application where SQL statements would be located inside the code itself. This was very practical too, but it required a preprocessor which would first parse the SQL statements from the code to convert this to other C++ statements and by replacing the SQL statements with those more complex C++ commands. Then the whole thing would be compiled to p-code. Then you'd need to link it with the additional SQL libraries and finally you had an executable.

Workshop Alex
+2  A: 

In basic terms, you write source files. These are fancy text files, which are taken in by the compiler which outputs some form of executable code (what executes it depends on the type of code you're talking about). The compiler has several parts:

  • Some form of preprocessing on the file which handles macros and the like (like from C).
  • A parser, which takes in source files, verifies that they conform to the syntactic rules of your language, and transforms the file into an in-memory data structure that is more easily manipulable by other parts of the program. This is called an Abstract Syntax Tree or AST.
  • Some form of AST analysis, which verifies that the actual code you wrote does not violate any rules of the language (e.g. recursion in a language that does not support it), as well as many other things.
  • Optimization such as tail call optimization, loop optimization, and many other kinds of optimizations.
  • Code generation), which is the actual process of taking the final AST and any other generated data and turning it into a binary file of some sort that can be executed or interpreted.

Interpreter:

An interpreter is a program that takes in some form of binary data that represents a program not compiled to code directly executable by the target machine, and runs the commands within. Examples are python, java, and lua.

Native code:

This is code that has been compiled into native instructions directly executable by the target machine. For instance; if you run on an x86 architecture then c++ will compile to an executable file that is understandable by the processor.

Virtual Machine:

This is generally a program built to simulate the construction and operation of a processor. It may be as simple as a program that reads in bytecode and runs native language operations based on the commands the bytecode represents (though calling this a virtual machine may be a stretch), or it may be as complex as completely simulating the behavior of a processor and all associated peripherals.

those other answers have good points in them but this info and links ought to get you started. Any other questions, just ask!

(Most of this article was written with the help of wikipedia though some was written from memory)

RCIX
+1  A: 

This series of lectures from Stanford covers several programming languages down to the bits and bolts, including Python (though I've only watched a couple of the C ones).

Pete Kirkham
+1  A: 

If you want to know how one goes from source code to something that actually runs on a target machine, you should get a copy of the famous Red Dragon Book. I've used it for building parsers and lexical analyzers. While it dates back to 1986, and I'm sure there's been progress in the interim, as far as I can tell, it hasn't been surpassed as a text.

It appears that Addison-Wesley has done a reprint of its predecessor, the Green Dragon Book, and is passing it off as something recent, so be careful to get the genuine article.

Bob Murphy