views:

277

answers:

3

I'm interested in writing an x86 assembler. I'm wondering what is a good way to map x86 assembly mnemonic instructions (using an Intel-like syntax) into the corresponding binary machine code instructions.

+2  A: 

For x86, it's complicated as hell. A little less complicated since 32-bit processors took over, but yeah. Still a pain.

You may want to take a look at nasm ( http://www.nasm.us ). It's an open source 32-bit assembler. See how they do it. Or, use it instead. :)

cHao
A: 

It's just a straight-up one-to-one mapping; the Intel documentation describes all of the instructions and their encodings. You'll need to build a giant lookup table or something equivalent to do the matching and code generation.

Carl Norum
something tells me you never looked at x86 encoding. a single mnemonic can correspond to multiple opcodes, each opcode can have many prefixes, size overrides... and I'm sure I'm missing some stuff.
Bahbar
I write x86 assembly code every day. It has to be one-to-one, otherwise how do you know which opcode gets emitted for which instruction you wrote? Just because there are prefixes, special modifiers, memory access or registered versions, etc. doesn't change the fact that for each instruction you write in the assembly file you have to know what machine instruction gets emitted....
Carl Norum
I take that back; it could be many-to-one, if you want to have multiple mnemonics generate the same machine instruction. It can't be one-to-many, though, unless you built some kind of context sensitivity into the assembler. The first case is unnecessary work, and the second case seems like a bad idea in general, so I'll let my answer stand.
Carl Norum
Look at this answer for examples of mappings that are one-to-many: http://stackoverflow.com/questions/2546715/how-to-analysis-how-many-bytes-each-instruction-takes-in-assembly/2761248#2761248
Nathan Fellman
That answer talks about the context sensitivity I mentioned above. Point well taken, assembler directives are often used to handle such mappings. That said, the information about what instruction will be emitted I'd still available to the programmer.
Carl Norum
well, if you want to have more examples that don't have prefixes, mov eax, [ebx] and mov [eax], ebx don't use the same opcode (89 and 8b, I believe). The x86 encoding is really not a 1-1 mapping with _mnemonics_. Yes, the assembler has all the data it needs to generate the assembly from the source. Not just from mnemonics is all I was saying.
Bahbar
+3  A: 

Do you want to understand the physical mapping of mnemonics to machine code? If so volume 2A & 2B of the the Intel IA32/IA64 reference manuals describe the binary format of x86 machine code .

The x86 instruction set page on Wikipedia has a compact listing of all the instructions categorized by when they were introduced, which might help you prioritize what to implement first.

However, if you are asking about how to go about parsing an assembly code text file to get to the point where your program could start writing out machine code then you basically need to understand how to write a compiler. The tools lex and yacc are good places to start but if you don't know how build a compiler you'll also need to get a book. I think the Dragon book is the best one out there but there are any number of other books you could use, SO has plenty of recommendations.

Andrew O'Reilly
You may not need a full fledged compiler for this. You need a simple two pass assembler with some sort of lookup table. You may not always generate the best code that way, but you'll get something that works.
Nathan Fellman