With gcc for example I think the option to use is -save-temps.
Roughly the steps are to make a pass on the file to pull all the includes in and create essentially a single file to be parsed. A lot of tools these days use a parser that runs on a set of rules (bison, yacc, flex, etc) the goal being to parse the ascii turning your program into a sort of very wide assembly language for lack of a better term.
a = a + 1;
could turn into
Load variable named a, size of blah, type unsigned foo
load immediate 1, size blah, unsigned
add
store result a
Then there are optimizations that can occur, the compilers intermediate language may have an increment function and determine that increment is better than the load of a 1 and an add. Eventually those optimizations are complete, and this intermediate code goes through a backend to the target instruction set. This is typically output as assembly and that is fed into an assembler which turns it into an object file, and target specific optimizations can occur. Then object files are fed into the linker which, well, links them together. One function in one program may be calling a function not in that object file named bob, the object file does not have an address or offset to reach bob it leaves a hole there for the address to be inserted and the linkers job is to connect all of these, decide where in the binary the function bob will live (assign it an address) then go find all the places that call bob and when those are placed in memory insert the instruction or address needed to allow bob to be called, so that the end result is an executable binary.
llvm which is already a competitor to gcc, provides good visibility into this process. you can have the C code compiled to an intermediate. Start with our bob function
unsigned int bob ( unsigned int a )
{
return(a+1);
}
compile to bitcode
clang -c -o bob.bc -emit-llvm bob.c
disassemble the bitcode to human readable form
llvm-dis bob.bc
Which results in bob.ll
define i32 @bob(i32 %a) nounwind {
entry:
%a.addr = alloca i32, align 4
store i32 %a, i32* %a.addr, align 4
%tmp = load i32* %a.addr, align 4
%add = add i32 %tmp, 1
ret i32 %add
}
Unoptimize code likes to be stored and fetched from memory often, and when passed into a function stored and fetched from the stack often.
In addition to easily letting you see behind the curtain, llvm is nice because you can optimize at any level, combine objects and optimize at the whole program level where gcc is going to limit you to file or function level only. So we can optimize this bitcode.
opt -std-compile-opts bob.bc -o bob_opt.bc
llvm-dis bob_opt.bc
And those extra stores and loads are gone and the meat of the function remains.
define i32 @bob(i32 %a) nounwind readnone {
entry:
%add = add i32 %a, 1
ret i32 %add
}
Then llc is used to turn that into assembler for the desired target
llc -march=arm bob.bc
cat bob.s
...
bob: @ @bob
@ BB#0: @ %entry
str r0, [sp, #-4]!
add r0, r0, #1
add sp, sp, #4
bx lr
...
llc -march=arm bob_opt.bc
cat bob_opt.s
...
bob: @ @bob
@ BB#0: @ %entry
add r0, r0, #1
bx lr
...
Yes, there are many many books out there. And many many compilers, etc. In addition to llvm, Fabrice Bellard (yes the qemu person), has a super simple, barely a compiler that produces an intermediate file that you can examine http://bellard.org/fbcc/ which is buried such that it is hardly known, fun to look at though if you are just getting into the guts of compilers. In addition there is one more well, known ,tcc http://bellard.org/tcc/ this one specifically does not have a backend that goes through an assembler, opcodes are generated directly both for speed and for real time (re)compiliation.