views:

485

answers:

7

I would like to learn the x86 Instruction Set Architecture. I don't meaning learning an assembly for x86. I want to understand the machine code baby.

The reason is that I would like to write an assembler for x86. Then I want to write a compiler that compiles to that assembly.

I know that there are the Intel manuals and AMD manuals that cover the x86 instruction set. But those are very large and dense.

I'm wondering if there is a more approachable (possibly tutorial) approach to learning the x86 instruction set architecture.

+1  A: 

If you just want to understand the numbers and some of the complexities such as Mod R/M bytes and the other oddities behind it, you may want to try implementing a simple 8086 emulator. (just the CPU). I found it to be a fun and interesting experience.

http://www.ousob.com/ng/iapx86/ is a really good reference I used when writing an emulator and gives a very nice list of opcodes along with CPU version that it appeared, and the hex opcode for each variation of the opcode.

Earlz
Implementing a simple 8086 emulator sounds like a good idea, thank you. Do you know of any references or resources about how to do that?
mudge
Hmm.. I see.. Well, there is this tutorial somewhere that covers basically making your own compiler (straight to machine code) and some of the x86s complexities. I don't know what it was called or anything, but it was at the DarkBasic forums over 5 years ago.. not sure it's still even there.
Earlz
And I had a really good "lower level" opcode reference, but it's at my other computer, so I'll edit my answer later to include it
Earlz
That would be awesome. I appreciate it.
mudge
@mudge, there you go.
Earlz
@Earlz: I will be studying this. Thank you!
mudge
+3  A: 

At some point you will have to cope with a bit of complexity. The x86 instruction set is large.

But you can make things substantially simpler by reading the documentation for an older CPU. Intel and AMD seem to add dozens of new instructions to each submodel. Try to read the Intel manual for the 80386, which is substantially smaller and yet covers much of what you will use.

I know a good (old) book but it is in French. It is called "Programmation du 80386" by J.-M. and M. Trio. I am not sure it is still edited nowadays (I bought mine nearly 20 years ago).

Thomas Pornin
Great idea. Thank you.
mudge
A: 

Unless you are writing an operating system you should never try to get any lower level than assembly level on any architecture. You need to keep in mind that x86 assembly is an abstraction layer that will translate into machine code that the processor instruction set can interpret. So if you want to understand what a processor does with the 32bit binary instructions I would take a look at any entry level computer science digital logic book, many of them have a primer on how processors work with these instruction sets. If you want to write a compiler you should start with something a bit higher level than binary instructions and read the red dragon book of compilers then work at writing your own assembler.

Ioxp
Thanks. Yes, besides learning the instruction set so that I can compile down to it, I also want to write an operating system.
mudge
I dont mean to sound rude but Compiler design, Assembler design, and OS design will require a lot of reading through endless amounts of code and its not really something that you can just jump into and hope to be good at. I recommend that if you want to get a primer audit a class at a college somewhere on the topic and you will be light years ahead of trying to work it out on your own.
Ioxp
True enough. I know my question sounds kind of like a sudden whim. But I've actually been thinking about these things for years and have been tinkering here and there with these things for years as well, and it seems to me that I'll probably be working and tinkering in these things in many years to come. But I'd really like to at some point get much more intensively involved and spending much more time on it.
mudge
x86 assembly directly produces the machine code instructions fed to the CPU. What happens inside the CPU is opaque, and of interest only to chip designers. In fact, last I looked the CPU deconstructed the x86 instructions and essentially ran them on a much simpler internal CPU. A digital logic book will be looking at a much simpler system than the x86 internals.
David Thornley
@Dave good insight you are right and i was a bit unclear on the topic thanks for the clarification. @Mudge Ill look around and see what i can find on course materials that i may have left over from my Compilers and OS class's from college and pass them on if i have anything. Also with regards to OS you should check out the brown university simulator (http://www.versiontracker.com/dyn/moreinfo/macosx/15091). It's a nice starting place for much of what you want to do.
Ioxp
+1  A: 

I think you are not realistic. You sed:

I know that there are the Intel manuals and AMD manuals that cover the x86 instruction set. But those are very large and dense.

...

I'd like to learn all of that. Perhaps I should start with what is simplest and easiest to learn.

Did you ask your self why there are large and dense? The answer is simple! If we are just looking Intel x86 products

  • There are: 8086, 8088 , 80186, 80188 and 80286 16 bit CPUs.
  • There are: 80386 and 80486 with build floating point coprocessor 32 bit CPUs.
  • There are: Pentium and Pentium MMX
  • There are: Pentium Pro, Pentium II and Pentium III
  • There are: Pentium 4 Pentium M, Pentium 5, Pentium 6, Celleron, Prescott
  • There are: Intel Core 2, Intel Core i7
  • There is:Intel Atom
  • There is:Sandy Bridge

  • There are 16, 32 and 64 bit architectures

  • There are several different math with floating point units.
  • There are several Streaming SIMD Extensions.
  • There are several protected models of CPU.

There are...

There are 32 years of R&D on x86 architectures . And I did'n mention AMD, VIA and so on!

No there is no faster way!

GJ
The first microprocessor was Intel's 4004. They then implemented the 8008, an 8-bit version. The popular 8080 was built on that, and the 8086/8088 was assembly-language compatible with the 8080 (although it wasn't efficient; the first MS BASIC for the 8088 was slower than the version for the 8080). Since the 8008 appeared in 1972, we're talking about 38 years, not 32.
David Thornley
You're right. The complete x86 instruction set architecture seems too big and complex. Let me be more specific. I'd like to learn a practical (and simplistic as possible but still practical) subset of the x86 instruction set for use on modern computers. I'd use this subset in the implementation of a compiler and operating system.
mudge
@David Thornley: yes you right. But the first x86 CPU was Intel 8086.
GJ
@GJ: Sure. I'm just pointing out that the 8086 was not by any means a de novo design, but one that was already encumbered with historical choices.
David Thornley
+2  A: 

I'd say jump to the deep water and start from there.

Start by writing a simple (C/++) application. Then use the epic debugger called OllyDbg ( http://www.ollydbg.de/ ). Debug your application and see how the compiler implemented your code. Check loops. Check function calls. Check API calls. Check memory manipulation.

By doing that you'll get a real idea of how to do things.

I've been debugging application this way and learned assembly. You say you want to UNDERSTAND the machine code and there's no better way in my opinion.

You may also check with something called "crackme" (google it). This will put you in a challenge to test your skills. Once you're in control you'll see that everything you want to know is just a matter of digging the instructions set manual. get the point? Challenge yourself with specific targets.

Good luck. It's not easy yet very possible.

Poni
+2  A: 

Well, I don't agree with you. Complexity of x86 is misunderstood and thus exaggerated. I'm not saying that it isn't complex. It surely is but thats the case only if want to write a full fledged Compiler or Assembler. If you just want to learn Assembly. It isn't that complex.

Lets break down x86-64 architecture to prove my point.


Registers:

x86-64 specifies few registers. How many exactly? Lets enumerate them

  • 16 General purpose registers (RAX, RBX, RCX, RDX,RSI,RDI, RBP, RSP + R8, R9, R10, R11, R12, R13, R14, R15)
  • 6 Segement registers (CS, DS, SS, ES, FS, GS)
  • 64-bit RFlags & 64-bit RIP
  • 8 80-bit Floating point (x87) registers (FPR0-FPR7) aliased to 64-bit MMX registers (MM0-MM7)
  • 16 128-bit extended media registers (XMM0-XMM7 + XMM8-XMM16)
  • some special/miscellaneous registers such as control registers (CR0 through 4), debug registers (DR0 through 3, plus 6 and 7), test registers (TR4 through 7), descriptor registers (GDTR, LDTR, IDTR), and a task register (TR) which we hardly need to care.

alt text


Addressing Modes:

How to reference any memory location?

Source: http://en.wikipedia.org/wiki/X86#Addressing_modes

Addressing modes for 32-bit address size on 32-bit or 64-bit x86 processors can be summarized by this formula:

alt text

Addressing modes for 64-bit code on 64-bit x86 processors can be summarized by these formulas:

alt text

and

RIP + [displacement]


Operation Modes:

These are the modes in which it can operate:

  1. Real mode
  2. Protected mode
    • Virtual 8086 mode
  3. Long mode

Instruction Set:

You hear people saying its a large instruction set. Well, there are around 500-600 instructions. But some of them are same instructions with very little variations like CMPS/CMPSB/CMPSW/CMPSD/CMPSQ. If you group them like this number comes down to 400 instructions.

Do you feel its very large? Then I have few questions. How many functions does a C Standard library has? how many functions does POSIX library has? What about .NET & Java? How many classes & methods do they have? Do we have to know all of the functions/methods/classes? What approach do we take for learning these libraries?

Just learn few from each. Roughly go through all of them. Get the feel of their existence and use the reference when you need.

We can logically divide these instructions into following categories:

  1. General-Purpose Instructions
    • Basic Data Manipulation (moving & copying)
    • Control Transfer (Jumps, Calls, Interrupts)
    • Arithmetic & Logic Instructions (add,sub,and,xor etc..)
    • String & Bit Oriented Instructions
    • System Calls
  2. System Instructions
  3. x87 Floating-Point Instructions
  4. 64-Bit Media (MMX) Instructions
  5. 128-Bit Media (SSE) Instructions

Thats it!! Thats all you need to know. Now frankly tell me. Is it that complex?

Just get any good book on assembly language covering x86 architecture. I would personally suggest "Assembly Language Programming in GNU/Linux for IA32 Architectures" By Rajat Moona because its short & to the point. Doesn't waste much of your time. But it doesn't cover X86-64.

After familiarized with IA32 for x86-64 read http://csapp.cs.cmu.edu/public/1e/public/docs/asm64-handout.pdf

claws
+1  A: 

Old versions of the NASM manual had a nice, concise reference, though being old the CPUs they refer to are only so recent. Here's a random copy I found. Lists opcodes (arranged so the patterns are easy to see), and describes the addressing mode encodings:

http://www.posix.nl/linuxassembly/nasmdochtml/nasmdoca.html

I wrote a runtime machine code generator (targeting 486 or better) using basically just this information, so there should be enough there to get you started...

brone
+1 for the link.
claws