I'm in middle of rewriting my assembler. While at it I'm curious about implementing disassembly as well. I want to make it simple and compact, and there's concepts I can exploit while doing so.
It is possible to determine rest of the x86 instruction encoding from opcode (maybe prefix bytes are required too, a bit). I know many people have written tables for doing it.
I'm not interested about mnemonics but instruction encoding, because it is an actual hard problem there. For each opcode number I need to know:
- does this instruction contain modrm?
- how many immediate fields does this instruction have?
- what encoding does an immediate use?
- is the immediate in field an instruction pointer -relative address?
- what kind of registers does the modrm use for operand and register fields?
sandpile.org has somewhat quite much what I'd need, but it's in format that isn't easy to parse.
Before I start writing and validating those tables myself, I decided to write this question. Do you know about this kind of tables existing somewhere? In a form that doesn't require too much effort to parse.
b byte
w word
v word or dword (or qword), depends on operand size attribute (0x66)
z word or dword (or dword), depends on operand size attribute
J instruction-relative address (next character describes type)
G instruction group, has modrm-field (next character describes operand type)
R has modrm-field (next two characters describe register and operand type)
M modrm, but operand field must point to memory
O direct offset (next character describes type)
F FPU
T separate table
_ defined, but no arguments
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
0 Rbb Rvv Rbb Rvv b z Rbb Rvv Rbb Rvv b z T
1 Rbb Rvv Rbb Rvv b z Rbb Rvv Rbb Rvv b z
2 Rbb Rvv Rbb Rvv b z Rbb Rvv Rbb Rvv b z
3 Rbb Rvv Rbb Rvv b z Rbb Rvv Rbb Rvv b z
4 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
5 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
6 _ _ Mvv z Rvvz b Rvvb
7 Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb Jb
8 Gbb Gvz Gbb Gvb Rbb Rvv Rbb Rvv Rbb Rvv Rbb Rvv Mvv
9 _ _ _ _ _ _ _ _ _ _ _ _
A Ob Ov Ob Ov _ _ _ _ b z _ _ _ _ _ _
B b b b b b b b b v v v v v v v v
C Gbb Gvb w _ _ b _ _
D Gb Gv Gb Gv F F F F F F F F
E Jz Jz Jb
F _ _ Gb Gv _ _ _ _ _ _ Gb Gv
Here I've got the table for first operand. The format is such that the table can be parsed straight out from a text file that contains it. I left away some CISC and segmentation related instructions.
For two-byte instructions the chances are I need four such tables. For three-byte instructions I'll need two tables more. FPU instructions require 8 tables, which are fortunately very simple. After that I'd have pretty large chunk of x86 instructions covered up. Though I go just fine with just one or two tables.
Further, few instruction groups might require some small arrays to recognise instruction type.