tags:

views:

62

answers:

1

I have created an interpreter for my programming language (educational) and now I'd like to go one step further and create a compiler for it. I know that this is pretty hard work.

What I already know is:

  • I need to translate my input language to assembler

A lot, isn't it? Now what I don't know is:

  • What assembler do I need to create Win32 PE executables like, for example, Visual Studio does?
  • What about file headers?

I'd prefer not to use MASM but it seems like I'll have to.

  • How to combine the assembler with my compiler?
A: 

You don't strictly need to translate your code to assembly; you can get away with translating it to any language that can compile to a native executable.

Let's look at an extremely simple example. Say I had some worthless imaginary language (hereafter called Adder) where the input file consisted of any number of lines, each containing a space-delimited list of integers. The output is the sum of each line.

So for an input file

1
1 2 3
200 50 6

the output would be

1
6
256

You can write an interpreter for Adder in a single line of Ruby:

puts($_.split.map(&:to_i).inject(0, :+)) while gets

What if I wanted to translate an input program to a standalone Ruby script? Simple:

while line = gets
  num = line.split.map(&:to_i).inject(0, :+)
  puts "puts(#{num})"
end

Output:

$ ruby adder2rb.rb nums.txt 
puts(1)
puts(6)
puts(256)
$ ruby adder2rb.rb nums.txt  | ruby -
1
6
256

Okay, now what if we want to translate this to something that actually compiles to a native executable -- say, C? We hardly have to change anything:

puts '#include <stdio.h>'
puts 'int main() {'

while line = gets
  num = line.split.map(&:to_i).inject(0, :+)
  puts "  printf(\"%ld\\n\", #{num}L);"
end

puts '  return 0;'
puts '}'

Session output:

$ ruby adder2c.rb nums.txt
#include <stdio.h>
int main() {
  printf("%ld\n", 1L);
  printf("%ld\n", 6L);
  printf("%ld\n", 256L);
  return 0;
}
$ ruby adder2c.rb nums.txt | tcc -
$ ./a.out
1
6
256

(Note here that tcc is Tiny C Compiler, which may be very useful to your project if you want end users to be able to generate executables from your generated C files.)

Want to translate to another high level language? How about Haskell?

$ cat adder2hs.rb
puts 'main = do'

while line = gets
  num = line.split.map(&:to_i).inject(0, :+)
  puts "  print #{num}"
end
$ ruby adder2hs.rb nums.txt
main = do
  print 1
  print 6
  print 256
$ ruby adder2hs.rb nums.txt | runghc
1
6
256

Of course, the code translator for any language with more than one construct will be significantly more complext than the above examples; however, the basic idea remains the same that you will have general templates that you follow for your output language.

Now if you decide that you still really want to generate assembly instead of high-level code, you aren't restricted to a single implementation there either. Somewhat easier than straight assembly is translating to a virtual machine's bytecode. MSIL would give you .NET executables, or you could use LLVM's code generation facilities. If Java is more your thing you can emit JVM bytecode. One slightly less common choice would be Parrot.

Of those VMs, AFAIK only LLVM will generate actual native executables, but maybe that isn't your top concern right now.

Mark Rushakoff