tags:

views:

506

answers:

3

Consider this x86 assembly code:

section .data

foo:
    mov ebx, [boo]
    mov [goo], ebx
goo:
    mov eax, 2
    mov eax, 3
    ret
boo:
    mov eax, 4
    mov eax, 5
    ret

What exactly is going on here? When I dereference [boo] and mov it to [goo] what exactly am I moving there? Just one command? The ret as well?


Follow-up questions:

  1. Does dereferencing a label give me an address? Or the machine code for the first command in the label?
  2. If it's a machine code - how can it possibly be more than one command? Aren't all commands essentially 32-bit (even if not all bits are used)?
  3. Bottom line - will eax have a value of 3 or 5 at the end?
+2  A: 

The first mov is copying from the offset goo relative to the segment register [e]DS. The second mov is writing at the offset of foo into a data location relative to the DS register. If the CS and DS are coincidental, then this can be ignored. Assuming the CS and DS are coincidental, you're next likely to run into various protection mechanisms that render code sections read-only.

RE followups:

  1. A label isnt like a reference - you dont dereference as such. The assembler substitutes in a number representing the location in the resulting code. You can load either the address, or the thing at the address. The [ and ] indicate dereferencing - I've fixed a confusing element in my first response to cover this. IOW doing [goo] loads the thing at that address.
  2. A CISC instruction set like x86 has [very] variable length instructions - some even not a multiple of the word length. RISC ones generally try to rstict this to make decoding instructions simpler.
  3. 3 - you are only modifing the first 4 bytes of the mov eax, 2 (which, due to the little endian encoding does get replaced with 4 but then gets overwritten by the next instruction which hasnt been modified at all - 5 is never in the picture as a candidate (I thought you were thinking the code gets reordered the way you first asked the question[1] though you clearly know quite a bit more as I should have guessed from your rep :P)]).

Note that all of this assumes that CS = DS and DEP isnt stepping in.

Also, if you were using BX instead of EBX, the sort of things you were expecting will come into play (using xX instead of ExX accesses the low 2 bytes of the register [and xL accesses the lowest byte])

[1] Remember that an assembler is purely a tool for writing opcodes - stuff like labels etc. all get boiled down to numbers etc. with very little magic or impressive transformations of the code - there's no closures or anything deep of that nature lurking in there. (This is slightly oversimplifying - code can be relocatable, and in many cases fixups get applied to usages of offsets by a combination of the linker and the loader)

Ruben Bartelink
right - I was with RISC in mind... thanks for clearing up that point
Yuval A
+7  A: 

boo is the offset of the instruction mov eax, 3 inside section .data. mov ebx, [boo] means “fetch four bytes at the offset indicated by boo inside ebx”. Likewise, mov [goo], ebx would move the content of ebx at the offset indicated by goo.

However, code is often read-only, so it wouldn't be surprising to see the code just crashing.

Here is how the instructions at boo are encoded:

boo:
b8 03 00 00 00          mov    eax,0x3
c3                      ret

So what you get in ebx is actually 4/5 of the mov eax, 3 instruction.

Bastien Léonard
It looks like this happens to work because they're not full 32-bit quantities and the last byte will always be 0. This code will fail if you try something like mov, eax 0xC000000
Michael
"fetch the four bytes" was what I was looking for. thanks!
Yuval A
+1  A: 

Follow up answers:

  1. It gives you the machine code starting at the address. How much of that depends of the length of your load, in this case it is 4 byte.

  2. It can be more than one command or only a fragment of a command. On this architecture (Intel x86) machine code commands are between 8 and 120 Bit.

  3. 3.

drhirsch