I tried running Kinopiko's example, except I increased the number of iterations by a factor of 100.
test.c:
#include <stdio.h>
int main ()
{
int i;
for (i = 0; i < 1000000; i++) {
printf ("%d\n", i);
}
return 0;
}
Then I took 10 stackshots (under VC, but you can use pstack). Here are the stackshots:
9 copies of this stack:
NTDLL! 7c90e514()
KERNEL32! 7c81cbfe()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
1 copy of this stack:
KERNEL32! 7c81cb96()
KERNEL32! 7c81cc75()
KERNEL32! 7c81cc89()
_write() line 168 + 57 bytes
_flush() line 162 + 23 bytes
_ftbuf() line 171 + 9 bytes
printf() line 62 + 14 bytes
main() line 7 + 14 bytes
mainCRTStartup() line 206 + 25 bytes
KERNEL32! 7c817077()
In case it isn't obvious, this tells you that:
mainCRTStartup() line 206 + 25 bytes Cost ~100% of the time
main() line 7 + 14 bytes Cost ~100% of the time
printf() line 62 + 14 bytes Cost ~100% of the time
_ftbuf() line 171 + 9 bytes Cost ~100% of the time
_flush() line 162 + 23 bytes Cost ~100% of the time
_write() line 168 + 57 bytes Cost ~100% of the time
In a nutshell, the program spends ~100% of it's time flushing to disk (or console) the output buffer as part of the printf on line 7.
(What I mean by "Cost of a line" is - it is the fraction of total time spent at the request of that line, and that's roughly the fraction of samples that contain it.
If that line could be made to take no time, such as by removing it, skipping over it, or passing its work off to an infinitely fast coprocessor, that time fraction is how much the total time would shrink. So if the execution of any of these lines of code could be avoided, time would shrink by somewhere in the range of 95% to 100%. If you were to ask "What about recursion?", the answer is It Makes No Difference.)
Now, maybe you want to know something else, like how much time is spent in the loop, for example. To find that out, remove the printf because it's hogging all the time. Maybe you want to know what % of time is spent purely in CPU time, not in system calls. To get that, just throw away any stackshots that don't end in your code.
The point I'm trying to make is if you're looking for things you can fix to make the code run faster, the data gprof gives you, even if you understand it, is almost useless. By comparison, if there is some of your code that is causing more wall-clock time to be spent than you would like, stackshots will pinpoint it.