views:

218

answers:

4

Background Information: Ultimately, I would like to write an emulator of a real machine such as the original Nintendo or Gameboy. However, I decided that I need to start somewhere much, much simpler. My computer science advisor/professor offered me the specifications for a very simple imaginary processor that he created to emulate first. There is one register (the accumulator) and 16 opcodes. Each instruction consists of 16 bits, the first 4 of which contain the opcode, the rest of which is the operand. The instructions are given as strings in binary format, e.g., "0101 0101 0000 1111".

My Question: In C++, what is the best way to parse the instructions for processing? Please keep my ultimate goal in mind. Here are some points I've considered:

  1. I can't just process and execute the instructions as I read them because the code is self-modifying: an instruction can change a later instruction. The only way I can see to get around this would be to store all changes and for each instruction to check whether a change needs to be applied. This could lead to a massive amounts of comparisons with the execution of each instruction, which isn't good. And so, I think I have to recompile the instructions in another format.

  2. Although I could parse the opcode as a string and process it, there are instances where the instruction as a whole has to be taken as a number. The increment opcode, for example, could modify even the opcode section of an instruction.

  3. If I were to convert the instructions to integers, I'm not sure then how I could parse just the opcode or operand section of the int. Even if I were to recompile each instruction into three parts, the whole instruction as an int, the opcode as an int, and the operand as an int, that still wouldn't solve the problem, as I might have to increment an entire instruction and later parse the affected opcode or operand. Moreover, would I have to write a function to perform this conversion, or is there some library for C++ that has a function convert a string in "binary format" to an integer (like Integer.parseInt(str1, 2) in Java)?

  4. Also, I would like to be able to perform operations such as shifting bits. I'm not sure how that can be achieved, but that might affect how I implement this recompilation.

Thank you for any help or advice you can offer!

+6  A: 

Parse the original code into an array of integers. This array is your computer's memory.

Use bitwise operations to extract the various fields. For instance, this:

unsigned int x = 0xfeed;
unsigned int opcode = (x >> 12) & 0xf;

will extract the topmost four bits (0xf, here) from a 16-bit value stored in an unsigned int. You can then use e.g. switch() to inspect the opcode and take the proper action:

enum { ADD = 0 };

unsigned int execute(int *memory, unsigned int pc)
{
  const unsigned int opcode = (memory[pc++] >> 12) & 0xf;

  switch(opcode)
  {
  case OP_ADD:
    /* Do whatever the ADD instruction's definition mandates. */
    return pc;
  default:
    fprintf(stderr, "** Non-implemented opcode %x found in location %x\n", opcode, pc - 1);
  }
  return pc;
}

Modifying memory is just a case of writing into your array of integers, perhaps also using some bitwise math if needed.

unwind
I was hoping that someone would mention a concept like this. I've never used it before though, so I'll have to do more research. Thank you!
Brandon
Ahh, memories of university projects!
sdg
+1. This is the basic approach you should take. The key point here, Brandon, relative to your question, is that in order to approach this normally, you need to get to "machine code" which will be the array of bytes in an array that represent your virtual computer's address space. Then if the instructions edit memory (code), you don't do anything special, you just follow the instructions and they should do the right thing inside your big virtual memory array. IOW, you need both the assembler (the tool that translates text strings into instruction bytes) and the emulator, the thing that executes
quixoto
Brandon
Brandon
@Brandon: The masking (bitwise and) is not strictly necessary, but it adds a certain amount of clarity. It also protects you from things that could happen if, for instance, there were bits set above the 16 lowest in the memory integer. You could squeeze these things out by e.g. using unsigned shorts and so on, but it's easier to be safe.
unwind
+2  A: 

I think the best approach is to read the instructions, convert them to unsigned integers, and store them into memory, then execute them from memory.

  1. Once you've parsed the instructions and stored them to memory, self-modification is much easier than storing a list of changes for each instruction. You can just change the memory at that location (assuming you don't ever need to know what the old instruction was).

  2. Since you're converting the instructions to integers, this problem is moot.

  3. To parse the opcode and operand sections, you'll need to use bit shifting and masking. For example, to get the op code, you mask off the upper 4 bits and shift down by 12 bits (instruction >> 12). You can use a mask to get the operand too.

  4. You mean your machine has instructions that shift bits? That shouldn't affect how you store the operands. When you get to executing one of those instructions, you can just use the C++ bit-shifting operators << and >>.

Nick Meyer
Regarding 1: In that case, I'd still have to recompile the set of instructions before executing them as opposed to executing them as they're read, correct? That's not a problem, I was just wondering whether there might be a good way to accomplish the non-recompilation method. // Regarding 4: Right, that's what I meant. I thought it'd affect how I'd store the instruction pieces, as I would have to consider each piece separately. But this bit math seems like it could achieve it. // What you've mentioned makes sense conceptually. I'll try it when I get home today. Thanks!
Brandon
I don't know what you mean by "recompile" here. A (simple) CPU doesn't "recompile" anything: it's got the bits, and it does what they say. Can you clarify what you mean by that?
Ken
My understanding is that emulators often function by recompiling the instructions to another format that can be more easily processed by the emulator's machine. I'm using the term loosely here, but I think it's the same idea. I was just wondering whether there is an efficient way to execute the instructions without having to store the instructions somewhere else in memory in another form (as unsigned ints, for example).
Brandon
Brandon
@Brandon, you're right, the 0xF000 isn't necessary. What it does is mask out the lowest 12 bits, setting them all to 0 -- but then the shift will toss them out anyway. My bad.
Nick Meyer
+1  A: 

Just in case it helps, here's the last CPU emulator I wrote in C++. Actually, it's the only emulator I've written in C++.

The spec's language is slightly idiosyncratic but it's a perfectly respectable, simple VM description, possibly quite similar to your prof's VM:

http://www.boundvariable.org/um-spec.txt

Here's my (somewhat over-engineered) code, which should give you some ideas. For instance it shows how to implement mathematical operators, in the Giant Switch Statement in um.cpp:

http://www.eschatonic.org/misc/um.zip

You can maybe find other implementations for comparison with a web search, since plenty of people entered the contest (I wasn't one of them: I did it much later). Although not many in C++ I'd guess.

If I were you, I'd only store the instructions as strings to start with, if that's the way that your virtual machine specification defines operations on them. Then convert them to integers as needed, every time you want to execute them. It'll be slow, but so what? Yours isn't a real VM that you're going to be using to run time-critical programs, and a dog-slow interpreter still illustrates the important points you need to know at this stage.

It's possible though that the VM actually defines everything in terms of integers, and the strings are just there to describe the program when it's loaded into the machine. In that case, convert the program to integers at the start. If the VM stores programs and data together, with the same operations acting on both, then this is the way to go.

The way to choose between them is to look at the opcode which is used to modify the program. Is the new instruction supplied to it as an integer, or as a string? Whichever it is, the simplest thing to start with is probably to store the program in that format. You can always change later once it's working.

In the case of the UM described above, the machine is defined in terms of "platters" with space for 32 bits. Clearly these can be represented in C++ as 32-bit integers, so that's what my implementation does.

Steve Jessop
Before I spoke with my professor, I searched for something simple to emulate and found that contest. It was still a bit too complicated for me to start with, but that might be my next step. I will look over your code for ideas. Thank you!
Brandon
A: 

I created an emulator for a custom cryptographic processor. I exploited the polymorphism of C++ by creating a tree of base classes:

struct Instruction  // Contains common methods & data to all instructions.
{
    virtual void execute(void) = 0;
    virtual size_t get_instruction_size(void) const = 0;
    virtual unsigned int get_opcode(void) const = 0;
    virtual const std::string& get_instruction_name(void) = 0;
};

class Math_Instruction
:  public Instruction
{
  // Operations common to all math instructions;
};

class Branch_Instruction
:  public Instruction
{
  // Operations common to all branch instructions;
};

class Add_Instruction
:  public Math_Instruction
{
};

I also had a couple of factories. At least two would be useful:

  1. Factory to create instruction from text.
  2. Factory to create instruction from opcode

The instruction classes should have methods to load their data from an input source (e.g. std::istream) or text (std::string). The corollary methods of output should also be supported (such as instruction name and opcode).

I had the application create objects, from an input file, and place them into a vector of Instruction. The executor method would run the 'execute()` method of each instruction in the array. This action trickled down to the instruction leaf object which performed the detailed execution.

There are other global objects that may need emulation as well. In my case some included the data bus, registers, ALU and memory locations.

Please spend more time designing and thinking about the project before you code it. I found it quite a challenge, especially implementing a single-step capable debugger and GUI.

Good Luck!

Thomas Matthews
BTW, at my alma mater, one of the professors had his students write functions to emulate hardware components for the *Microprocessor Design* class. We had some good discussions about processor emulation. :-)
Thomas Matthews