What other programs do the same thing as gprof?
Try OProfile. it is much better tool for profiling your code. I would also suggest Intel Vtune.
The two above tools can narrow down time spent in a particular line of code, annotate your code, show assembly and how much particular instruction takes. Beside Time metric, you can also query specific counters, i.e. cache hits, etc.
unlike gprof, you can profile any process/binary running on your system using either of two.
gprof (read the paper) exists for historical reasons.
If you think it will help you find performance problems, it was never advertised as such. It is only a measurement tool, and only of cpu-bound operations.
Try this instead.
Here's an example.
There's a simple observation about programs. In a given execution, every instruction is responsible for some fraction of the overall time, in the sense that if it were not there, the time would not be spent. During that time, the instruction is on the stack **. When that is understood, you can see that -
gprof embodies certain myths about performance, such as:
that program counter sampling is useful.
It is only useful if you have an unnecessary hotspot bottleneck such as a bubble sort of a big array of scalar values. As soon as you, for example, change it into a sort using string-compare, it is still a bottleneck, but program counter sampling will not see it because now the hotspot is in string-compare. On the other hand if it were to sample the extended program counter (the call stack), the point at which the string-compare is called, the sort loop, is clearly displayed.that timing functions is more important than capturing time-consuming lines of code.
The reason for that myth is that gprof was not able to capture stack samples, so instead it used "instrumentation" which times functions, counts their invocations, and tries to capture the call graph. However, once a costly function is identified, you still need to look inside it for the lines that are responsible for the time. If there were stack samples you would not need to look, those lines would be on the samples.that the call graph is important.
What you need to know about a program is not where it spends its time, but why. When it is spending time in a function, every line of code on the stack gives one link in the chain of reasoning of why it is there. If you can only see part of the stack, you can only see part of the reason why, so you can't tell for sure if that time is actually necessary. What does the call graph tell you? Each arc tells you that some function A was in the process of calling some function B for some fraction of the time. Even if A has only one such line of code calling B, that line only gives a small part of the reason why. If you are lucky enough, maybe that line has a poor reason. Usually, you need to see multiple simultaneous lines to find a poor reason if it is there. If A calls B in more than one place, then it tells you even less.that recursion is a tricky confusing issue.
That is only because gprof and other profilers perceive a need to generate a call-graph and then attribute times to the nodes. If one has samples of the stack, the time-cost of each line of code that appears on samples is a very simple number - the fraction of samples it is on. If there is recursion, then a given line can appear more than once on a sample. No matter, the time slice represented by the sample is being spent because the line is on it, regardless of how many times it is on it. If the line could be made to take no time (such as by deleting it or branching around it), then the fraction of time it had been on the stack would no longer be spent.that accuracy of time measurement (and therefore a large number of samples) is important.
Think about it for a second. If a line of code is on 3 samples out of five, then if you could shoot it out like a light bulb, that is roughly 60% less time that would be used. Now, you know that if you had taken a different 5 samples, you might have only seen it 2 times, or as many as 4. So that 60% measurement is more like a general range from 40% to 80%. If it were only 40%, would you say the problem is not worth fixing? So what's the point of time accuracy, when what you really want is to find the problems?that counting of statement or function invocations is useful.
Suppose you know a function has been called 1000 times. Can you tell from that what fraction of time it costs? You also need to know how long it takes to run, on average, multiply it by the count, and divide by the total time. The average invocation time could vary from nanoseconds to seconds, so the count alone doesn't tell much. If there are stack samples, the cost of a routine or of any statement is just the fraction of samples it is on. That fraction of time is what could in principle be saved overall if the routine or statement could be made to take no time, so that is what has the most direct relationship to performance.that samples need not be taken when blocked
The reasons for this myth are twofold: 1) that PC sampling is meaningless when the program is waiting, and 2) the preoccupation with accuracy of timing. However, for (1) the program may very well be waiting for something that it asked for, such as file I/O, which you need to know, and which stack samples reveal. (Obviously you want to exclude samples while waiting for user input.) For (2) if the program is waiting simply because of competition with other processes, that presumably happens in a fairly random way while it's running. So while the program may be taking longer, that will not have a large effect on the statistic that matters, the percentage of time that statements are on the stack.that "self time" matters
Self time only makes sense if you are measuring at the function level, not line level, and you think you need help in discerning if the function time goes into purely local computation versus in called routines. If summarizing at the line level, a line represents self time if it is at the end of the stack, otherwise it represents inclusive time. Either way, what it costs is the percentage of stack samples it is on, so that locates it for you in either case.that samples have to be taken at high frequency
This comes from the idea that a performance problem may be fast-acting, and that samples have to be frequent in order to hit it. But, if the problem is costing, 20%, say, out of a total running time of 10sec (or whatever), then each sample in that total time will have a 20% chance of hitting it, no matter if the problem occurs in a single piece like this
.....XXXXXXXX...........................
.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^
(20 samples, 4 hits)
or in many small pieces like this
X...X...X.X..X.........X.....X....X.....
.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^
(20 samples, 3 hits)
Either way, the number of hits will average about 1 in 5, no matter how many samples are taken, or how few. (Average = 20 * 0.2 = 4. Standard deviation = +/- sqrt(20 * 0.2 * 0.8) = 1.8.)that you are trying to find the bottleneck
as if there were only one. What you are looking for is activities that take a significant percent of time that you can replace with something faster, and there can be several of those. They are like acrobats forming a human tower, with the biggest guy on the bottom, up to the smallest on top. The biggest guy has the largest percentage of the weight, so that's the one to remove. Once you've done that, the next biggest guy has the largest percentage of the weight, so you can go after him next. That's the "magnification effect". Each problem you remove reduces the overall time and makes the others take a larger percentage of what remains. (Example: Fix a 20% problem. Time shrinks to 4/5 of what it was. Percentages of remaining problems expand by 5/4 to get back to 100%.) Since you're always removing, if not the biggest one, one of the big ones, that makes the remainder bigger, percentage-wise, so you can keep going until you can't anymore, and then you've made massive speedup factors.
** With the exception of other ways of requesting work to be done, that don't leave a trace telling why, such as by message posting.
valgrind
has an instruction-count profiler with a very nice visualizer called kcachegrind
. As Mike Dunlavey recommends, valgrind counts the fraction of instructions for which a procedure is live on the stack, although I'm sorry to say it appears to become confused in the presence of mutual recursion. But the visualizer is very nice and light-years ahead of gprof
.
Take a look at Sysprof.
Your distribution may have it already.