I want self-education purpose implement a simple virtual machine for a dynamic language, prefer in C language. Something like the Lua VM, or Parrot, or Python VM, but simpler. Are there any good resources/tutorials on achieving this, apart from looking at code and design documentations of the existing VMs?

Thanks in advance for your answers and ideas

Edit: why close vote? I don't understand - is this not programming. Please comment if there is specific problem with my question.


One possibility would be to read Virtual Machine Design and Implementation C/C++. I should give the disclaimer that while I glanced at it in the bookstore, I haven't really read through it. It survived my usual test of opening it to a few random pages and seeing if they contained any obvious errors, but that's an extremely unscientific test at best. At the same time, it's good enough that (just for one example) every book by Herbert Schildt I've looked at has failed in under five minutes...

Jerry Coffin
sadly it has low average rating there on amazon and very bad many reviews
+1  A: 

Here someone has the same question maybe it will help you.

+5  A: 

Well, it's not about implementing a VM in C, but since it was the last tab I had open before I saw this question, I feel like I need point out an article about implementing a QBASIC bytecode compiler and virtual machine in JavaScript using the <canvas> tag for display. It includes all of the source code to get enough of QBASIC implemented to run the "nibbles" game, and is the first in a series of articles on the compiler and bytecode interpreter; this one describes the VM, and he's promising future articles describing the compiler as well.

By the way, I didn't vote to close your question, but the close vote you got was as a duplicate of a question from last year on how to learn about implementing a virtual machine. I think this question (about a tutorial or something relatively simple) is different enough from that one that it should remain open, but you might want to refer to that one for some more advice.

Brian Campbell
+5  A: 

I assume you want a virtual machine rather than a mere interpreter. I think they are two points on a continuum. An interpreter works on something close to the original representation of the program. A VM works on more primitive (and self-contained) instructions. This means you need a compilation stage to translate the one to the other. I don't know if you want to work on that first or if you even have an input syntax in mind yet.

For a dynamic language, you want somewhere that stores data (as key/value pairs) and some operations that act on it. The VM maintains the store. The program running on it is a sequence of instructions (including control flow). You need to define the set of instructions. I'd suggest a simple set to start with, like:

  • basic arithmetic operations, including arithmetic comparisons, accessing the store
  • basic control flow
  • built-in print

You may want to use a stack-based computation approach to arithmetic, as many VMs do. There isn't yet much dynamic in the above. To get to that we want two things: the ability to compute the names of variables at runtime (this just means string operations), and some treatment of code as data. This might be as simple as allowing function references.

Input to the VM would ideally be in bytecode. If you haven't got a compiler yet this could be generated from a basic assembler (which could be part of the VM).

The VM itself consists of the loop:

1. Look at the bytecode instruction pointed to by the instruction pointer.
2. Execute the instruction:
   * If it's an arithmetic instruction, update the store accordingly.
   * If it's control flow, perform the test (if there is one) and set the instruction pointer.
   * If it's print, print a value from the store.
3. Advance the instruction pointer to the next instruction.
4. Repeat from 1.

Dealing with computed variable names might be tricky: an instruction needs to specify which variables the computed names are in. This could be done by allowing instructions to refer to a pool of string constants provided in the input.

An example program (in assembly and bytecode):

offset  bytecode (hex)   source
 0      01 05 0E         //      LOAD 5, .x
 3      01 03 10         // .l1: LOAD 3, .y
 6      02 0E 10 0E      //      ADD .x, .y, .x
10      03 0E            //      PRINT .x
12      04 03            //      GOTO .l1
14      78 00            //      .x: "x"
16      79 00            //      .y: "y"

The instruction codes implied are:

"LOAD x, k" (01 x k) Load single byte x as an integer into variable named by string constant at offset k.
"ADD k1, k2, k3" (02 v1 v2 v3) Add two variables named by string constants k1 and k2 and put the sum in variable named by string constant k3.
"PRINT k" (03 k) Print variable named by string constant k.
"GOTO a" (04 a) Go to offset given by byte a.

You need variants for when variables are named by other variables, etc. (and the levels of indirection get tricky to reason about). The assembler looks at the arguments like "ADD .x, .y, .x" and generates the correct bytecode for adding from string constants (and not computed variables).

nice. any idea for resource to go from here?
@zaharpopv: I'm not too sure about implementing the dynamic functionality of your language, but a simple VM design like the above is easy enough that once you've done it you will learn how suitable it is and can afford to change it to support more interesting features. Also, looking at the set of instructions for the Python interpreter might give you a few ideas on how to support dynamism.

Another resource to look at is the implementation of the Lua language. It is a register-based VM that has a good reputation for performance. The source code is in ANSI C89, and is generally very readable.

As with most high performance scripting languages, the end user sees a readable, high level dynamic language (with features like closures, tail calls, immutable strings, numbers and hash tables as the primary data types, functions as first class values, and more). Source text is compiled to the VM's bytecode for execution by a VM implementation whose outline is pretty much as described by Edmund's answer.

A great deal of effort has gone into keeping the implementation of the VM itself both portable and efficient. If even more performance is needed, a just in time compiler from VM byte code to native instructions exists for 32-bit x86, and is in beta release for 64-bit.