views:

446

answers:

3

Hi everyone,

So this morning I posted a confused question about assembly and I received some great genuine help, which I really appreciate.

And now I'm starting to get into assembly and am beginning to understand how it works.

Things I feel I understand alright include the stack, interrupts, binary/hex, and in general what most of the basic operations do (jmp, push, mov, etc).

Concepts that I'm struggling to understand and would like help with are below - it would be a huge help if you could address any of the following:

  1. What exactly is happening in the .data section? Are those variables we're declaring?
  2. If so, can we declare variables later in the code section? If not, why not? If so, how, and why do we use the data section then?
  3. What's a register? How does it compare to a variable? I mean I know it's a location that stores a small piece of information... but that sounds exactly like a variable to me.
  4. How do I make an array? I know this seems kind of random, but I'm curious as to how I'd go about doing something like this.
  5. Is there a list somewhere of common practices for what each register should be used for? I still don't get them completely, but have noticed some people saying, for example, that a certain register should be used to store 'return values' from procedures - is there a comprehensive or at least informative list of such practices?
  6. One of the reasons I'm learning assembly is to better understand what's going on behind my high level code. With that in mind - when I'm programming in c++, I'm often thinking about the stack and the heap. In assembly I know what the stack is - where's the 'heap'?

Some info: I'm using masm32 with WinAsm as an IDE, and I'm working on Windows 7. I have a lot of prior experience programming in higher level languages such as c++/java.


edit: Thanks for the help everyone, extremely informative as usual! Great stuff! One last thing though - I'm wondering what the difference is between the Stack Pointer, and the Base pointer, or ESP and EBP. Can someone help me out?

edit: I think I get it now... ESP always points to the top of the stack. However, you can point EBP at whatever you want. ESP is automatically handled but you can do whatever you want with EBP. For example:

push 6
push 5
push 4
mov EBP, ESP
push 3
push 2

In this scenario, EBP now points to the address holding 4, but ESP now points to the address holding 2.

In a real application 6, 5, and 4 could have been function arguments, whereas 3 and 2 could be local variables within that function.

+9  A: 

Let's try to answer in order!

  1. The data section contains anything that you want to be automatically initialized for you by the system before it calls the entry point of your program. You're right, normally global variables end up here. Zero-initialized data is generally not included in the executable file, since there's no reason to - a couple of directives to the program loader are all that's needed to generate that space. Once your program starts running, the ZI and data regions are generally interchangeable. Wikipedia has a lot more information.

  2. Variables don't really exist when assembly programming, at least not in the sense they do when you're writing C code. All you have is the decisions you've made about how to lay out your memory. Variables can be on the stack, somewhere in memory, or just live only in registers.

  3. Registers are the internal data storage of the processor. You can, in general, only do operations on values in processor registers. You can load and store their contents to and from memory, which is the basic operation of how your computer works. Here's a quick example. This C code:

    int a = 5;
    int b = 6;
    int *d = (int *)0x12345678; // assmume 0x12345678 is a valid memory pointer
    *d = a + b;
    

    Might get translated to some (simplified) assembly along the lines of:

    load  r1, 5
    load  r2, 6
    load  r4, 0x1234568
    add   r3, r1, r2
    store r4, r3
    

    In this case, you can think of the registers as variables, but in general it's not necessary that any one variable always stay in the same register; depending on how complicated your routine is, it may not even be possible. You'll need to push some data onto the stack, pop other data off, and so on. A 'variable' is that logical piece of data, not where it lives in memory or registers, etc.

  4. An array is just a contiguous block of memory - for a local array, you can just decrement the stack pointer appropriately. For a global array, you can declare that block in the data section.

  5. There are a bunch of conventions about registers - check your platform's ABI or calling convention document for details about how to use them correctly. Your assembler documentation might have information as well. Check the ABI article on wikipedia.

  6. Your assembly program can make the same system calls any C program could, so you can just call malloc() to get memory from the heap.

Carl Norum
+1 Thanks again Carl. Very helpful! Especially the example on 3 was helpful. I guess when it really comes down to it, assembly has to be different than C, otherwise it actually would be C itself. For #4... I thought the stack was strictly first in first out. However I guess you can read from the stack without pushing/popping then? Also, I found this which seems useful and relevant: http://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames
Cam
Yes you can read from the stack without push/pop: you just need the stack pointer, and the offset. In x86 asm the stack pointer is contained in the esp register, so you can use esp+offset (in fact, if you disassemble a C app, you will see this method used to access the 'local' variables within a function call).
slugster
@incrediman, The stack is just in memory along with everything else, so you can access it however you want. It's only a stack in the sense of using it to store execution context for function calls and so on.
Carl Norum
First of all in my original comment I meant last in first out, not fifo (can't edit it now though)! Second of all... Thanks @ slugster and Carl, that makes sense. Also, I was reading the link I just posted, and I now understand what Carl meant in 4. Very cool. How would you go about declaring a global array though? With malloc?
Cam
@incrediman, I'm not a `masm` expert, but something like `myArray word 100` would make an array of 100 words if I googled correctly.
Carl Norum
A register storing a variable is like memory storage a variable, except: a) a register has no memory address (true for all contemporary processors), b) it is faster to access and manipulate than a memory variable, c) it uses fewer instruction bytes (shorter instructions) to address than a memory variable, and d) there is a very small number of registers in any CPU architecture, somewhere around 2, 8, 16, or 32 for CISC processors, and 32 to 128 for RISC processors whereas there can be billions of memory variables at once.
wallyk
This has really been helpful so thanks again. One thing I'm a bit confused about is esp vs ebp, or Stack Pointer vs Base Pointer... can someone outline the difference?
Cam
A: 

1) Yes 2) Code is for code 3) A register is a hardware storage element that can hold up to one data word or address.

Rob
+4  A: 

I'd like to add to this. Programs on a computer are typically split up into three sections, although there are others.

Code Segment - .code, .text : http://en.wikipedia.org/wiki/Code_segment

In computing, a code segment, also known as a text segment or simply as text, is a phrase used to refer to a portion of memory or of an object file that contains executable instructions. It has a fixed size and is usually read-only. If the text section is not read-only, then the particular architecture allows self-modifying code. Read-only code is reentrant if it can be executed by more than one process at the same time. As a memory region, a code segment resides in the lower parts of memory or at its very bottom, in order to prevent heap and stack overflows from overwriting it.

Data Segment - .data : http://en.wikipedia.org/wiki/Data_segment

A data segment is one of the sections of a program in an object file or in memory, which contains the global variables and static variables that are initialized by the programmer. It has a fixed size, since all of the data in this section is set by the programmer before the program is loaded. However, it is not read-only, since the values of the variables can be altered at runtime. This is in contrast to the Rodata (constant, read-only data) section, as well as the code segment (also known as text segment).

BSS : http://en.wikipedia.org/wiki/.bss

In computer programming, .bss or bss (which originally stood for Block Started by Symbol) is used by many compilers and linkers as the name of a part of the data segment containing static variables and global variables that are filled solely with zero-valued data initially (i. e., when execution begins). It is often referred to as the "bss section" or "bss segment". The program loader initializes the memory allocated for the bss section when it loads the program.

Registers are, as described by others, facilities of the CPU to store data or a memory address. Operations are performed upon registers, such as add eax, ebx and depending on the assembly dialect, that means different things. In this case, this translates to add the contents of ebx to eax and store it in eax (NASM syntax). The equivalent in GNU AS (AT&T) is: movl $ebx, $eax. Different dialects of assembly have different rules and operators. I'm not a fan of MASM for this reason - it is very different to both NASM, YASM and GNU AS.

There isn't really an in general interaction with C. ABI's designate how this happens; for example, on x86 (unix) you'll find a method's arguments pushed onto the stack, whereas in x86-64 on Unix the first few arguments will be positioned in registers. Both ABIs expect the result of the function to be stored in the eax/rax register.

Here's a 32-bit add routine that assembles for both Windows and Linux.

_Add
    push    ebp             ; create stack frame
    mov     ebp, esp
    mov     eax, [ebp+8]    ; grab the first argument
    mov     ecx, [ebp+12]   ; grab the second argument
    add     eax, ecx        ; sum the arguments
    pop     ebp             ; restore the base pointer
    ret

Here, you can see what I mean. The "return" value is found in eax. By contrast, the x64 version would look like this:

_Add
    push    rbp             ; create stack frame
    mov     rbp, rsp
    mov     eax, edi        ; grab the first argument
    mov     ecx, esi        ; grab the second argument
    add     eax, ecx        ; sum the arguments
    pop     rbp             ; restore the base pointer
    ret

There are documents that define this sort of thing. Here's the UNIX x64 ABI: http://www.x86-64.org/documentation/abi-0.99.pdf. I'm sure you could probably find ABIs for any processor, platform etc you needed.

How do you operate on an array in assembly? Pointer arithmetic. Given a base address at eax the next stored integer would be at [eax+4] if the integer is 4 bytes in size. You could create this space using calls up to malloc/calloc, or you call the memory allocation system call, whatever that is on your system.

What is the 'heap'? According to wikipedia again, it's the area of memory reserved for dynamic memory allocation. You don't see it in your assembly program until you call calloc, malloc or the memory allocation system call, but it is there.

Sorry for the essay.

Ninefingers
+1, very helpful, thanks.
Cam
+1 for linking the ABI.
Mustapha Isyaku-Rabiu