views:

115

answers:

3

I'm currently working my way through Andrew Appel's Modern Compiler Implementation in Java, and I'm right around the point where I build the low-level intermediate representation.

Initially, I had decided to target the JVM and ignore all of the low-level machine stuff, but in the interest of learning things that I don't know much about I've had a change of heart. This changes my IR, because targeting the JVM allows me to (more or less) wave my hands at making a method call or constructing an object.

The Appel book doesn't go into detail about any specific machine architecture, so I'd like to know where I can find out everything I need to know to go farther.

The things that I'm currently aware that I need to know are:

  • Which instruction set to use. I have two laptops I could develop on; both have Core 2 Duo processors. My current understanding is that x86 processors mostly use the same instruction set, but they are not all exactly the same.

  • Whether the operating system affects the code generation step of compilation, or whether it is completely dependent on the processor. For example, I know something is different about generating code to run on a 32-bit vs. a 64-bit platform.

  • How stack frames and such are organized. When to use registers vs. putting parameters on the stack, caller-save vs. callee-save, all of that. I'd have thought that this would be described along with the instruction set but so far I haven't seen this particular info anywhere. Maybe I'm misunderstanding something here?

Links to resources in lieu of answers are perfectly welcomed.

+3  A: 

How stack frames and such are organized. When to use registers vs. putting parameters on the stack, caller-save vs. callee-save, all of that. I'd have thought that this would be described along with the instruction set but so far I haven't seen this particular info anywhere. Maybe I'm misunderstanding something here?

In general, there are no right answers to these questions. You can use whatever calling conventions you want...unless you want to interoperate with other people's code. For interoperability, compilers standardize on Application Binary Interfaces. My understanding is that the Itanium C++ ABI has become a popular standard in recent years. Try starting there.

Nathan Kitchen
Thanks, Nathan. I don't quite understand the purpose of the Itanium C++ ABI as it relates to my purposes (for example, what role does C++ play when developing a compiler for another language?); however, this link eventually led me to the various x86 calling conventions (cdecl, etc) which is what I was looking for.
danben
+1  A: 

I can't answer all of you questions; but

  • Basic x86 instruction set is compatible across x86 family of processors. You're not planning to implement any specific extensions, are you?
  • I don't think your OS or architecture matters much for code generation
  • Default answer for anything compiler related is the Dragon book. Have you looked at it yet?
EightyEight
+3  A: 

Most of the x86 instruction set is common to all processors -- it's a reasonably safe bet that your processors both have the same instruction set, except possibly for SIMD instructions that probably won't be very useful to you when implementing a simple compiler (these instructions are normally used to make multimedia applications and the like go faster). The instruction set is listed in Intel's manuals -- 2A and 2B in particular have a full listing of instructions and their behaviour, although the other volumes are worth taking a look at.

When generating user space code, the choice of operating system matters when it comes to syscalls. For instance, if you want a program to output something to the terminal on 64 bit Linux, you need to make a system call by:

  • loading the value 1 into register rax to indicate this is a write system call.
  • loading the value 1 into register rdi to indicate stdout should be used (1 is the file descriptor for stdout)
  • loading the start address of what you want to print into register rsi
  • loading the length of what you want to print into register rdx
  • executing the syscall instruction once the registers (and memory) have been set up.

The return value from write is stored in rax.

A different operating system might have a different system call number for write, might have a different way of passing in arguments (x86-64 Linux system calls always use rdi, rsi, rdx, r10, r8, and r9 in that order for parameters, with the system call number in rax), and might have different system calls altogether.

The convention for ordinary function calls on Linux is similar -- the order of registers is rdi, rsi, rdx, rcx, r8, and r9 (so all the same, except using rcx instead of r10), with further arguments on the stack and a return value in rax. According to this page, registers rbp, rbx, and r12 up to r15 should be preserved across function calls. You are, of course, free to make up your own convention (unless making a system call), but that makes it harder to call be called from code generated or written by others.

Michael Williamson
Thanks, Michael - this answer was also very helpful. I wish I could accept it as well; my own fault for combining too many questions. +1, though.
danben
Actually, reading this for the second time I think that this answers all of my questions most thoroughly.
danben