views:

513

answers:

6

Disassembling printf doesn't give much info:

(gdb) disas printf
Dump of assembler code for function printf:
0x00401b38 <printf+0>:  jmp    *0x405130
0x00401b3e <printf+6>:  nop
0x00401b3f <printf+7>:  nop
End of assembler dump.


(gdb) disas 0x405130
Dump of assembler code for function _imp__printf:
0x00405130 <_imp__printf+0>:    je     0x405184 <_imp__vfprintf+76>
0x00405132 <_imp__printf+2>:    add    %al,(%eax)

How is it implemented under the hood?

Why disassembling doesn't help?

What does * mean before 0x405130?

+8  A: 

Here's one particular implementation, http://ftp.fr.openbsd.org/pub/OpenBSD/src/lib/libc/stdio/printf.c and http://ftp.fr.openbsd.org/pub/OpenBSD/src/lib/libc/stdio/vfprintf.c

nos
And here's how glibc does it: http://sourceware.org/git/?p=glibc.git;a=blob;f=stdio-common/vfprintf.c;h=6e0e85cd7cca9f4dfc9e86fb702db131ab2e1639;hb=refs/heads/master#l195
susmits
A: 

printf() is most likely located in a dynamic shared library. The dynamic linker fills a table with the addresses of the imported functions; that's why you have to make that indirect call.

I don't really recall how this works; it's probable that optimizations complicate the process. But you get the idea.

Bastien Léonard
+1  A: 

I'd say disassembling works just fine here, and that printf is implemented 'under the hood' here using vfprintf, which is pretty much what you'd expect. Note that assembler is typically much more verbose than the C, and time consuming to make sense of where you don't have the annotated source. Compiler output is not a great way of teaching yourself assembler either.

Shane MacLaughlin
+2  A: 

Virtually all C compilers provide the source the their runtime libraries - not just open source compilers. Unfortunately, they're often written in rather difficult to follow form and they don't generally come with design rationale documents.

So, a very nice resource for dealing with that problem is P.J. Plauger's "The Standard C Library", which provides not only the source for a library implementation but also has details on how it's designed and the special situations that such a library might have to consider.

At the prices that some of the 'used' versions of the book are being offered, it's a steal and should be on any serious C programmer's bookshelf.

Plauger has similar books targeting the C++ library that I think have similar value:

Michael Burr
+1  A: 

As for

What does * mean before 0x405130?

I'm not familiar with gdb's disassembler, but it looks like the jmp *0x405130 is an indirect jump through a pointer. Instead of disassembling what's at 0x405130 you should dump the 4 bytes of memory there. I'd be willing to bet that you'll find another address there, and if you disassemble that location you'll find printf()'s code (how readable that disassembly might be is another story).

In other words, _imp__printf is a pointer to printf(), not printf() itself.


Edit from after more information in the comments below:

A litle poking around indicates that jmp *0x405130 is the GAS/AT&T assembly syntax for jmp [0x405130] instruction when using the Intel assembly syntax.

What makes this curious is that you say that the gdb command x/xw 0x405130 shows that that address contains 0x00005274 (which seems to match up with what you got when you disassembled 0x405130). However, that would mean that jmp [0x405130] would try to jump to address 0x00005274, which doesn't seem right (and gdb said as much when you tried to disassemble that address.

It's possible that the _imp_printf entry is using some sort of lazy binding technique where the first time execution jumps through 0x405130, it hits the 0x00005274 address which causes the OS to field a trap and fixup the dynamic link. After the fixup, the OS will restart execution with the correct link address in 0x405130. But this is sheer guesswork on my part. I have no idea if the system you're using does anything like this (indeed, I don't even know what system you're running on), but it's technically possible. If something like this is going on, you won't see the correct address in 0x405130 until after the first call to printf() has been made.

I think you'll need to single step through a call to printf() at the assembly level to see what's really going on.


Updated information with a GDB session:

Here's the problem you're running into - you're looking at the process before the system has loaded DLLs and fixed up the linkages to the DLLs. Here's a debugging session of a simple "hello world" program compiled with MinGW debugged with GDB:

C:\temp>\mingw\bin\gdb test.exe
GNU gdb (GDB) 7.1
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html&gt;
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "mingw32".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/&gt;...
Reading symbols from C:\temp/test.exe...done.

(gdb) disas main
Dump of assembler code for function main:
   0x004012f0 <+0>:     push   %ebp
   0x004012f1 <+1>:     mov    %esp,%ebp
   0x004012f3 <+3>:     sub    $0x8,%esp
   0x004012f6 <+6>:     and    $0xfffffff0,%esp
   0x004012f9 <+9>:     mov    $0x0,%eax
   0x004012fe <+14>:    add    $0xf,%eax
   0x00401301 <+17>:    add    $0xf,%eax
   0x00401304 <+20>:    shr    $0x4,%eax
   0x00401307 <+23>:    shl    $0x4,%eax
   0x0040130a <+26>:    mov    %eax,-0x4(%ebp)
   0x0040130d <+29>:    mov    -0x4(%ebp),%eax
   0x00401310 <+32>:    call   0x401850 <_alloca>
   0x00401315 <+37>:    call   0x4013d0 <__main>
   0x0040131a <+42>:    movl   $0x403000,(%esp)
   0x00401321 <+49>:    call   0x4018b0 <printf>
   0x00401326 <+54>:    mov    $0x0,%eax
   0x0040132b <+59>:    leave
   0x0040132c <+60>:    ret
End of assembler dump.

Note that disassembling printf() leads to a similar indirect jump:

(gdb) disas printf
Dump of assembler code for function printf:
   0x004018b0 <+0>:     jmp    *0x4050f8     ; <<-- indirect jump
   0x004018b6 <+6>:     nop
   0x004018b7 <+7>:     nop
End of assembler dump.

And that the _imp__printf symbiol makes no sense as code...

(gdb) disas 0x4050f8
Dump of assembler code for function _imp__printf:
   0x004050f8 <+0>:     clc                 ; <<-- how can this be printf()?
   0x004050f9 <+1>:     push   %ecx
   0x004050fa <+2>:     add    %al,(%eax)
End of assembler dump.

or as a pointer...

(gdb) x/xw 0x4050f8
0x4050f8 <_imp__printf>:        0x000051f8  ; <<-- 0x000051f8 is an invalid pointer

Now, let's set a breakpoint at main(), and run to it:

(gdb) break main
Breakpoint 1 at 0x40131a: file c:/temp/test.c, line 5.

(gdb) run
Starting program: C:\temp/test.exe
[New Thread 11204.0x2bc8]
Error while mapping shared library sections:
C:\WINDOWS\SysWOW64\ntdll32.dll: No such file or directory.

Breakpoint 1, main () at c:/temp/test.c:5
5           printf( "hello world\n");

printf() looks the same:

(gdb) disas printf
Dump of assembler code for function printf:
   0x004018b0 <+0>:     jmp    *0x4050f8
   0x004018b6 <+6>:     nop
   0x004018b7 <+7>:     nop
End of assembler dump.

but _imp__printf looks different - the dynamic link has now been fixed up:

(gdb) x/xw 0x4050f8
0x4050f8 <_imp__printf>:        0x77bd27c2

And if we disassemble what _imp__printf is now pointing to, it might not be very readable, but clearly it's code now. This is printf() as implemented in MSVCRT.DLL:

(gdb) disas _imp__printf
Dump of assembler code for function printf:
   0x77bd27c2 <+0>:     push   $0x10
   0x77bd27c4 <+2>:     push   $0x77ba4770
   0x77bd27c9 <+7>:     call   0x77bc84c4 <strerror+554>
   0x77bd27ce <+12>:    mov    $0x77bf1cc8,%esi
   0x77bd27d3 <+17>:    push   %esi
   0x77bd27d4 <+18>:    push   $0x1
   0x77bd27d6 <+20>:    call   0x77bcca49 <msvcrt!_lock+4816>
   0x77bd27db <+25>:    pop    %ecx
   0x77bd27dc <+26>:    pop    %ecx
   0x77bd27dd <+27>:    andl   $0x0,-0x4(%ebp)
   0x77bd27e1 <+31>:    push   %esi
   0x77bd27e2 <+32>:    call   0x77bd400d <wscanf+3544>
   0x77bd27e7 <+37>:    mov    %eax,-0x1c(%ebp)
   0x77bd27ea <+40>:    lea    0xc(%ebp),%eax
   0x77bd27ed <+43>:    push   %eax
   0x77bd27ee <+44>:    pushl  0x8(%ebp)
   0x77bd27f1 <+47>:    push   %esi
   0x77bd27f2 <+48>:    call   0x77bd3330 <wscanf+251>
   0x77bd27f7 <+53>:    mov    %eax,-0x20(%ebp)
   0x77bd27fa <+56>:    push   %esi
   0x77bd27fb <+57>:    pushl  -0x1c(%ebp)
   0x77bd27fe <+60>:    call   0x77bd4099 <wscanf+3684>
   0x77bd2803 <+65>:    add    $0x18,%esp
   0x77bd2806 <+68>:    orl    $0xffffffff,-0x4(%ebp)
   0x77bd280a <+72>:    call   0x77bd281d <printf+91>
   0x77bd280f <+77>:    mov    -0x20(%ebp),%eax
   0x77bd2812 <+80>:    call   0x77bc84ff <strerror+613>
   0x77bd2817 <+85>:    ret
   0x77bd2818 <+86>:    mov    $0x77bf1cc8,%esi
   0x77bd281d <+91>:    push   %esi
   0x77bd281e <+92>:    push   $0x1
   0x77bd2820 <+94>:    call   0x77bccab0 <msvcrt!_lock+4919>
   0x77bd2825 <+99>:    pop    %ecx
   0x77bd2826 <+100>:   pop    %ecx
   0x77bd2827 <+101>:   ret
   0x77bd2828 <+102>:   int3
   0x77bd2829 <+103>:   int3
   0x77bd282a <+104>:   int3
   0x77bd282b <+105>:   int3
   0x77bd282c <+106>:   int3
End of assembler dump.

It's probably harder to read than you might hope because I'm not sure if proper symbols are available for it (or whether GDB can properly read those symbols).

However, as I mentioned in another answer, you can get typically get the source for C runtime routines with your compiler, whether open source or not. MinGW doesn't come with the source for MSVDRT.DLL since that's a Windows thing, but you can get the source for it (or something pretty close to it) in a Visual Studio distribution - I think that even the free VC++ Express comes with runtime source (but I might be wrong about that).

Michael Burr
How to do what you said with gdb? -- "Instead of disassembling what's at 0x405130 you should dump the 4 bytes of memory there"
Mask
@Mask: I have little experience with GDB (especially at the command line), but I think that something like "`x/xw 0x405130`" will display the contents of that memory location. In a Windows debugger, I'd type "`dd 0x405130`" if that's any help.
Michael Burr
I tried,only get weird things:`(gdb) x/xw 0x4051300x405130 <_imp__printf>: 0x00005274(gdb) disas 0x00005274No function contains specified address.(gdb) x 0x000052740x5274: Cannot access memory at address 0x5274`
Mask
@Mask: sorry - looks like my guess might be wrong. However, you should be able to see exactly what's going on my single stepping in GCC at the assembly level- that may be a better approach to find the 'meat' of the `printf()` function.
Michael Burr
@Mask: I've updated the answer with a bit more guesswork...
Michael Burr
If that instruction is to fetch the value in the address([addr]),why disas addr make sense?
Mask
@mask: the disassembly doesn't make much sense - the 1st instruction (bytes 0x74 0x52), `je 0x405184 <_imp__vfprintf+76>`, will conditionally jump based on a flag that hasn't been set one way or another, and the 2nd (bytes 0x00 0x00), `add %al,(%eax)`, is the instruction for the opcode consisting of zero bytes, which is nearly always an indication that garbage is being disassembled (and what is `EAX` pointing to anyway? It hasn't been setup to anything).
Michael Burr
If it's invalid,what's `_imp__printf+0`,`_imp__printf+2` there for?
Mask
Michael Burr
I'm in windows,using MinGW and msys.BTW,where does the symbolic name come from?Symbol table by `-g` option?
Mask
@Mask: yes the symbol table comes from the `-g` option. I've updated the answer with a lot more detail since I've been able to recreate your scenario.
Michael Burr
What's the the complete expanded form of commands like `x/wx`,`x/bx`?I know `w`/`b` means `word`/`byte`,what about `x`?
Mask
@Mask: Unfortunately you're asking the wrong guy for details about GDB - what you see in my answer above is about the extent of my expertise... but to answer your specific question about the `x` command: the command means 'eXamine memory', the `x` in the display format option (after the `/`) means 'heXadecimal'. More details here: http://www.delorie.com/gnu/docs/gdb/gdb_56.html
Michael Burr
Very impressive!
Mask
+2  A: 

The * is AT&T assembler syntax for indirect memory reference. I.e.

jmp *<addr>

means "jump to the address stored in <addr>".

It is equivalent to the following Intel syntax:

jmp [addr]

Branch addressing using registers or memory operands must be prefixed by a '*'

Source

Alex
If that instruction is to fetch the value in the address(`[addr]`),why `disas addr` make sense?
Mask
What makes you think it makes sense?
Alex
The output of `disas addr` is valid(in my post).
Mask
It doesn't look valid to me. Have you tried disas'ing the address it points to?
Alex
If it's invalid,what's `_imp__printf+0`,`_imp__printf+2` there for?
Mask