I have a C++ application I'm in the process of optimizing. What tool can I use to pinpoint my slow code? :)
You may have a look to gprof. The gnu profiler.
Another interesting tool may be IBM rational quantify but it's not free
I assume you're using GCC. The standard solution would be to profile with gprof.
Be sure to add -pg
to compilation before profiling:
cc -o myprog myprog.c utils.c -g -pg
I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
Related question here.
A few other buzzwords if gprof
does not do the job for you: Valgrind, Intel VTune, Sun DTrace.
You can use callgrind, together with KCacheGrind it gives a pretty nice profiler. Besides that, Intel VTune is free for educational use on Linux. Intel VTune is probably the best profiler out there. If you have an AMD CPU, use AMD Codeanalyst, which is also available for Linux; this one is only decent, but free.
oprofile is good because it makes it much easier than gprof to profile multiple programs at once. You also can run it on your release build (if it has symbols), instead of having to build a special profiling build.
If you don't care about taking a massive performance hit (50x), valgrind is good.
In addition to Intel Vtune / AMD CodeAnalyst, perfmon2 is a OSS alternative, that requires a patched kernel to open up the CPU performance counter, and that would give you various performance figure that you can gather. perfmon2 is still implementation specific, i.e. L2 cache misses are called different things on intel P3 compared to AMD64, and they're beginning work on perfmon3, which should unify the API.
But generally, gprof would work well enough for you to detect slow code.
I've had very good luck with Rational Quantify (expensive but very good) and oprofile. Be aware when using oprofile that its a statistical profiler, not a full-on tracing profiler like Quantify. oprofile uses a kernel module to poke into the call stack of every running process on every interval so certain behaviors may not be caught. Using multiple profilers is good, especially since different profilers tend to give you different data all of which is useful.
As for using gprof, its ok. I would get one of the graphical front-ends, since the data can be rather difficult to get through just on the command line. I would avoid valgrind, until you require memory checking.
OK, downvote time...
If your goal is to use a profiler, use one of the suggested ones.
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes.
Caveat: programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack. Call graphs don't give you the same information, because 1) they don't summarize at the instruction level, and 2) they give confusing summaries in the presence of recursion. They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find.
P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
ADDED: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I
(call or otherwise) which is on the call stack some fraction f
of the time (and thus costs that much). For simplicity, suppose we don't know what f
is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
Then suppose we take just 2 stack samples, and we see instruction I
on both samples, designated observation o=2/2
. This gives us new estimates of the frequency f
of I
, according to this:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x)
0.1 1 1 0.1 0.1 0.25974026
0.1 0.9 0.81 0.081 0.181 0.47012987
0.1 0.8 0.64 0.064 0.245 0.636363636
0.1 0.7 0.49 0.049 0.294 0.763636364
0.1 0.6 0.36 0.036 0.33 0.857142857
0.1 0.5 0.25 0.025 0.355 0.922077922
0.1 0.4 0.16 0.016 0.371 0.963636364
0.1 0.3 0.09 0.009 0.38 0.987012987
0.1 0.2 0.04 0.004 0.384 0.997402597
0.1 0.1 0.01 0.001 0.385 1
P(o=2/2) 0.385
The last column says that, for example, the probability that f
>= 0.5 is 92%, up from the prior assumption of 60%.
Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I
is cheap. Then we get:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x)
0.001 1 1 0.001 0.001 0.072727273
0.001 0.9 0.81 0.00081 0.00181 0.131636364
0.001 0.8 0.64 0.00064 0.00245 0.178181818
0.001 0.7 0.49 0.00049 0.00294 0.213818182
0.001 0.6 0.36 0.00036 0.0033 0.24
0.001 0.5 0.25 0.00025 0.00355 0.258181818
0.001 0.4 0.16 0.00016 0.00371 0.269818182
0.001 0.3 0.09 0.00009 0.0038 0.276363636
0.001 0.2 0.04 0.00004 0.00384 0.279272727
0.991 0.1 0.01 0.00991 0.01375 1
P(o=2/2) 0.01375
Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I
. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
(The key is that we see I
more than once. If we only see it once, that doesn't tell us much except that f
> 0.)
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n
samples are taken, and f
is the cost, then I
will appear on nf+/-sqrt(nf(1-f))
samples. Example, n=10
, f=0.3
, that is 3+/-1.4
samples.)
You can use valgrind with the following options
valgrind --tool=callgrind ./(Your binary)
It will generate a file called callgrind.out.x
. You can then use kcachegrind
tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.
You can use Sun Studio's collect/analyzer. Using collect you can also profile memory usage, threads, MPI, etc. You also get a nice timeline view of your program.
If you use these tools in Solaris you can also get hardware performance counter information like in vtune or oprofile.
You can get this (and other very useful tools from Sun) at: [http://developers.sun.com/sunstudio/downloads/express/index.jsp][1]
I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:
(wikipedia) Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor. Instead, Valgrind first translates the program into a temporary, simpler form called Intermediate Representation (IR), which is a processor-neutral, SSA-based form. After the conversion, a tool (see below) is free to do whatever transformations it would like on the IR, before Valgrind translates the IR back into machine code and lets the host processor run it.
Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probing profiler.
Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.
Oprofile is a decent free option, but I've had better luck using Zoom. It's a commercial (free eval) profiler for Linux that has a sweeet GUI for seeing hotspots in your source code.
Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools
)
These come with classic sampling profilers (man-page) as well as the awesome timechart!
There is also LTTng (http://lttng.org/) I've never used that one though, so I cannot tell how well it works. But one advantage it has is an userspace tracer. In some situations that could be rather nice to have.