I'm having trouble understanding how compilers and linkers work and the files they create. More specifically, how do .cpp, .h, .lib, .dll, .o, .exe all work together? I am mostly interested in C++, but was also wondering about Java and C#. Any books/links would be appreciated!
For an operating system agnostic and language agnostic explanation , but still a bit POSIXy try:
Tanenbaum - Modern Operating Systems 3rd Edition.
It covers all that.
There are suprisingly few books on this topic Here are some thoughts:
Do not bother withn the Dragon Book unless you are actually writing a compiler using a table driven approach. It is a very hard read abd does not cover the simplest approach to parsing - recursive descent - in any detail. Caveat: I haven't read the leatest edition.
If you actually want to write a compiler, take a look at "Brinch Hansen on Pascal Compilers", which is an easy read and provides the full siource for a small pascal compiler. Don't let the pascal stuff put you off - the lessons it teaches are applicable to all compiled languages.
When it comes to linking, there are very few resources. The best book I've read on the subject is Linkers & Loaders.
What I say below is only approximate, but captures what I believe are some essential things you need to know.
In C++ the phases of compilation are (1) preprocessing, (2) actual compilation, and (3) linking.
The preprocessing phase takes as input a cpp file and does textual substitutions guided by directives like "#include" and "#define". In particular, the content of h files is copied verbatim in the place of "#include" directives.
The actual compilation produces machine code that lives in o files. Most instructions that appear in o files are instructions that the processor knows about, with the exception of *call function_name*. The processor doesn't know about names, it only knows about addresses.
In the (static) linking phase, multiple o files are put together. Now we know where the definition of a function ends up. That is, we know its address. The *call function_name* instructions are transformed into *call function_address* instructions that the processor knows how to execute. The lib files are precompiled bundles of o files, and they are taken as input by the (static) linker. They contain the machine code for functions such as printf, memset, etc.
Some names are not transformed into addresses during static linking. These are the names that refer to functions whose definitions live in a dll file. (Like lib files, dll files are also bundles of o files.) These leftover names are converted into proper addresses while the program runs (that is, at runtime) in a process called dynamic linking. This process involves finding the proper dll file and locating the function with the given name.
In Java the story is a little different. First, there is no preprocessing. Second, the result of the compilation is not machine code but bytecode, and lives in class files (not o files). Bytecode is similar to machine code but at a higher level of abstraction. In particular, in bytecode you can say *call function_name*. This means that there is no static linking phase and that the lookup of the function by name is always done at runtime. The bytecode runs not on the real machine but on a virtual machine. C# is similar to Java, the main difference being that the bytecode (called Common Intermediate Language in the case of C#) is slightly different.
I don't think you really need any books for this. As I understand your question, you simply want to know what each type of file is for, and how they relate to the compilation process. If you want to know everything in detail, or if you're writing your own C++ compiler, you'll obviously need to hit the books.
But here's the high-level version:
First, let's ignore linkers. Not every language uses a dedicated linker, and in fact, even the C and C++ language standards do not even mention linking. The linker is an implementation detail that is typically used to make all the pieces fit together, but it's technically not required to exist at all.
Also, this is very C/C++ specific. The compilation process is different for every language, and in particular, C/C++ use a messy, obsolete and inefficient mechanism that most modern languages avoid.
First, you write some code. This code is saved in a number of files (typically with the extension .c, .cc or .cpp), and a number of headers (.h, .hh or .hpp). These extensions are not required though. They are just a common convention, but technically, you could name your files anything.
For the sake of example, let's assume we have the following files:
foo.h:
void foo();
foo.cpp:
#include "foo.h"
#include "bar.h"
void foo() {
bar();
}
bar.h:
void bar();
bar.cpp:
#include "bar.h"
void bar() {
}
The compiler takes one .cpp file, and processes that. Let's say we compile foo.cpp first. The first thing it does is preprocessing: Expanding all macros, processing #include directives by copy/pasting the contents of the included file into the location it is #include'd from. When this is done, you have a translation unit, or a compilation unit, and it's going to look like this:
void foo(); //#include "foo.h"
void bar(); //#include "bar.h"
void foo() {
bar();
}
Basically, all that happened in our simple example is that the headers got copy/pasted in.
Now the compiler compiles this to machine code, as much as it can. Of course, given that it can only see this one code file, it is going to run into a function calls to functions it can't see the definition of.
How should it implement the call to bar()
in our case? It can't, because it can't see what bar
does. All it can see (because it included bar.h
is that the function bar
exists, and that it takes no arguments and returns void. So the compiler basically generates a little "fill in later" label, essentially saying "jump to the address of this function, as soon as we find out what address that is".
Now we've compiled foo.cpp
.
The output of this process is an object file, typically with the extension .o or .obj.
The compiler is now called on bar.cpp
as well, and much the same things happen. Headers are included, and then the code is compiled to machine code, although this time, we shouldn't run into any problems with missing definitions.
So we're now left with foo.o
and bar.o
containing the compiled code for each of the two compilation units.
Now we're in a funny no-man's land, where the C++ language standard tells us what the program should do, but has nothing more to say about how to get there, but the program doesn't actually do that yet. We don't have a program yet. So to fix this, we invoke the linker.
We feed it all our object files, and it reads through them and essentially fills in the blanks. When reading foo.o
, it will notice that there is a call to bar()
, where the address of bar()
was unknown. But the linker has access to bar.o``as well, so it is able to look up the definition of
bar(), and determine its address, which it can paste into the call site inside the
foo()` function. It basically links together these standalone object files. When it has resolved all these problems, it throws all the code together into one binary file (with the .exe extension on windows), which is your program. The actual code is generated by the compiler, and the linker then jumps in and links together the definitions from one file with the references to it in other files.