views:

82

answers:

4

I'd like the compiler to output a file containing the pointers to all global variables in the source code it is compiling, and also the sizes of them.

Is this possible? Is there a way to do it in any c compiler?

+3  A: 

Something like a map file? That will show where the globals and statics are allocated, but not what they point at. Most compilers (linkers) will output one automatically or with a simple statement. Just search for map file in your documentation.

Michael Dorgan
+1  A: 

While no compiler is required to output this data, most linkers can dump out this information. For example, Microsoft's linker mapfile contains all the public symbols in an executable/dll as well as their address relative to the section (read only, read write, code, zero initialized, etc.) they are put in. Sizes can be derived from that, although it's mainly an approximation.

You can also probably figure out a way to inspect the debugging symbols generated for the executable, as that's exactly what a debugger has to do anyway.

MSN
+1  A: 

Normally you'd get this from the linker, not the compiler -- the linker is what assigns addresses to things. Most linkers can produce a map file that will contain the addresses of global variables and functions (as well as any other symbols in the executable it creates). It'll be up to you to sort out which are which. All of them I've seen include something to tell you, but the exact format varies with the linker involved.

Jerry Coffin
+1  A: 

This information is available in the symbol table of the binary, though it might not mean what you expect it to.

The compiler takes one or more source files, compiles the code to object code, and generates an object file (.o on Unix, .obj on Windows). All variables and functions referenced in the source file are mentioned in the symbol table. Variables and functions that are defined in the source file have specific addresses and sizes, while symbols not defined in the source file are marked as undefined and must be linked later. All symbols are listed relative to a particular section. Common sections are ".text" for executable code, ".bss" for variables that are initialized to zero when the program starts, and ".data" for variables initialized with non-zero values.

The linker takes one or more object files, combines the sections (putting all of code and data from each object file into one big section for code and data), and writes an output file. This output file may be an executable, or it may be a shared library. An executable on disk still doesn't have a pointer for each variable; it still stores the offset from the beginning of the section to the variable.

When an executable is run, the operating system's dynamic loader reads the executable, finds each section, and allocates memory for that section. (It may also set up different permissions on each section -- the ".text" segment is often marked as read-only, and (on processors that support it) data segments are sometimes marked as non-executable.) Only then does a variable get a pointer -- when the code needs to access a particular variable, it adds the address of the beginning of the section to the offset from the beginning of the section to get the pointer.

You can use various tools to investigate each binary's symbol table. The GNU toolchain's objdump (used on Linux) is one such tool.

For a simple C hello-world program:

#include <stdio.h>

const char message[] = "Hello world!\n";

int main(int argc, char ** argv) {
        printf(message);
        return 0;
}

I compile (but don't link) it on my Linux box:

$ gcc -c hello.c -o hello.o

Now I can look at the symbol table:

$ objdump -t hello.o
hello.o:     file format elf32-i386

SYMBOL TABLE:
00000000 l    df *ABS*  00000000 hello.c
00000000 l    d  .text  00000000 .text
00000000 l    d  .data  00000000 .data
00000000 l    d  .bss   00000000 .bss
00000000 l    d  .rodata        00000000 .rodata
00000000 l    d  .note.GNU-stack        00000000 .note.GNU-stack
00000000 l    d  .comment       00000000 .comment
00000000 g     O .rodata        0000000e message
00000000 g     F .text  0000002b main
00000000         *UND*  00000000 puts

The first column is the address of each symbol, relative to the beginning of the section. Each symbol has various flags, and some of the symbols are used as hints to the rest of the toolchain and the debugger. (If I built with debugging symbols, I'd see many entries devoted to them as well.) My simple program has only one variable:

00000000 g     O .rodata        0000000e message

The fifth column tells me the symbol message is size 0xe -- 14 bytes.

Commodore Jaeger