views:

30418

answers:

14

I have a C++ application I'm in the process of optimizing. What tool can I use to pinpoint my slow code? :)

+2  A: 

You may have a look to gprof. The gnu profiler.

Another interesting tool may be IBM rational quantify but it's not free

dario minonne
+32  A: 

I assume you're using GCC. The standard solution would be to profile with gprof.

Be sure to add -pg to compilation before profiling:

cc -o myprog myprog.c utils.c -g -pg

I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.

Related question here.

A few other buzzwords if gprof does not do the job for you: Valgrind, Intel VTune, Sun DTrace.

Nazgob
I agree that gprof is the current standard. Just a note, though, Valgrind is used to profile memory leaks and other memory-related aspects of your programs, not for speed optimization.
Bill the Lizard
Bill,In vaglrind suite you can find callgrind and massif.Both are pretty useful to profile apps
dario minonne
@Bill-the-Lizard: Some comments on **gprof** : http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
+19  A: 

You can use callgrind, together with KCacheGrind it gives a pretty nice profiler. Besides that, Intel VTune is free for educational use on Linux. Intel VTune is probably the best profiler out there. If you have an AMD CPU, use AMD Codeanalyst, which is also available for Linux; this one is only decent, but free.

Anteru
I've had success with AMD Codeanalyst even on Intel chipsets. Very nice tool for a freebie :)
Mike
Well, but it sometimes gives very weird results, and it's not too stable... if it works, it's good, but I didn't get it working too often.
Anteru
+3  A: 

What about Valgrind?

Pretty sure you can use Cachegrind or some similar plugin to do profiling.

Peter
+23  A: 

oprofile is good because it makes it much easier than gprof to profile multiple programs at once. You also can run it on your release build (if it has symbols), instead of having to build a special profiling build.

If you don't care about taking a massive performance hit (50x), valgrind is good.

Greg Rogers
+1, oprofile is great for looking at an entire system, and for profiling code in kernel space
orip
Oprofile also seems to work better than gprof for functions coded in assembly than many other tools.
Adam K. Johnson
oprofile is always described as a kernel profiler or system profiler. I've never used it as such. I set oprofile to filter on the executable I'm currently working on and ignore the kernel and the rest of the system. It's honestly the best way to find performance problems. As a bonus OProfile and measure statistics other than raw CPU usage. My personal favorite is L2_cache misses, perfect for finding cache thrashing in threaded code.
caspin
+2  A: 

In addition to Intel Vtune / AMD CodeAnalyst, perfmon2 is a OSS alternative, that requires a patched kernel to open up the CPU performance counter, and that would give you various performance figure that you can gather. perfmon2 is still implementation specific, i.e. L2 cache misses are called different things on intel P3 compared to AMD64, and they're beginning work on perfmon3, which should unify the API.

But generally, gprof would work well enough for you to detect slow code.

Calyth
+1  A: 

I've had very good luck with Rational Quantify (expensive but very good) and oprofile. Be aware when using oprofile that its a statistical profiler, not a full-on tracing profiler like Quantify. oprofile uses a kernel module to poke into the call stack of every running process on every interval so certain behaviors may not be caught. Using multiple profilers is good, especially since different profilers tend to give you different data all of which is useful.

As for using gprof, its ok. I would get one of the graphical front-ends, since the data can be rather difficult to get through just on the command line. I would avoid valgrind, until you require memory checking.

+115  A: 

OK, downvote time...

If your goal is to use a profiler, use one of the suggested ones.

However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.

Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.

You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes.

Caveat: programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack. Call graphs don't give you the same information, because 1) they don't summarize at the instruction level, and 2) they give confusing summaries in the presence of recursion. They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find.

P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.

P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).

Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.

Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.

ADDED: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I (call or otherwise) which is on the call stack some fraction f of the time (and thus costs that much). For simplicity, suppose we don't know what f is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.

Then suppose we take just 2 stack samples, and we see instruction I on both samples, designated observation o=2/2. This gives us new estimates of the frequency f of I, according to this:

Prior                                    
P(f=x) x  P(o=2/2|f=x) P(o=2/2&&f=x)  P(o=2/2&&f >= x)  P(f >= x)

0.1    1     1             0.1          0.1            0.25974026
0.1    0.9   0.81          0.081        0.181          0.47012987
0.1    0.8   0.64          0.064        0.245          0.636363636
0.1    0.7   0.49          0.049        0.294          0.763636364
0.1    0.6   0.36          0.036        0.33           0.857142857
0.1    0.5   0.25          0.025        0.355          0.922077922
0.1    0.4   0.16          0.016        0.371          0.963636364
0.1    0.3   0.09          0.009        0.38           0.987012987
0.1    0.2   0.04          0.004        0.384          0.997402597
0.1    0.1   0.01          0.001        0.385          1

                  P(o=2/2) 0.385                

The last column says that, for example, the probability that f >= 0.5 is 92%, up from the prior assumption of 60%.

Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I is cheap. Then we get:

Prior                                    
P(f=x) x  P(o=2/2|f=x) P(o=2/2&& f=x)  P(o=2/2&&f >= x)  P(f >= x)

0.001  1    1              0.001        0.001          0.072727273
0.001  0.9  0.81           0.00081      0.00181        0.131636364
0.001  0.8  0.64           0.00064      0.00245        0.178181818
0.001  0.7  0.49           0.00049      0.00294        0.213818182
0.001  0.6  0.36           0.00036      0.0033         0.24
0.001  0.5  0.25           0.00025      0.00355        0.258181818
0.001  0.4  0.16           0.00016      0.00371        0.269818182
0.001  0.3  0.09           0.00009      0.0038         0.276363636
0.001  0.2  0.04           0.00004      0.00384        0.279272727
0.991  0.1  0.01           0.00991      0.01375        1

                  P(o=2/2) 0.01375                

Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.

(The key is that we see I more than once. If we only see it once, that doesn't tell us much except that f > 0.)

So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n samples are taken, and f is the cost, then I will appear on nf+/-sqrt(nf(1-f)) samples. Example, n=10, f=0.3, that is 3+/-1.4 samples.)

Mike Dunlavey
I'm a programmer and I work on an environment where performance are important. I've to say that this technique it's very good technique (that's why I'm going to upvote you :). But what can you do for recursive programn? Not a lot.That's why I keep to thing that rational quantify it's a good tool
dario minonne
Thanks. Actually it has no problem with recursion. If a call instruction appears >1 time on a sample, that is still only 1 sample.The time an instruction costs ~= the number of samples it is on.
Mike Dunlavey
This looks like a novel quick and dirty alternative to profiling. Thanks for suggesting it.
Waylon Flinn
@Waylon: Thanks. It is quick. I wish it were not so novel. Nor is it dirty. I consider it, for the purpose of locating problems, much better than typical profilers because it pinpoints problems rather than just giving clues.
Mike Dunlavey
@Mike Maybe after I've given it a try I'll agree on the dirty.
Waylon Flinn
This is basically a poor man's sampling profiler, which is great, but you run the risk of a too-small sample size which will possibly give you entirely spurious results.
Crashworks
@Crash: I won't debate the "poor man" part :-) It's true that statistical measurement precision requires many samples, but there are two conflicting goals - measurement and problem location. I'm focussing on the latter, for which you need precision of location, not precision of measure. So for example, there can be, mid-stack, a single function call A(); that accounts for 50% of time, but it can be in another large function B, along with many other calls to A() that are not costly. Precise summaries of function times can be a clue, but every other stack sample will pinpoint the problem.
Mike Dunlavey
@Crash: These examples may give a flavor of why it works: http://stackoverflow.com/questions/890222/analyzing-code-for-efficiency/893272#893272
Mike Dunlavey
Mike Dunlavey
Most modern sampling profilers will give you the call stack as well for each sample. I'm thinking of VTune and xbperfview in particular: they walk the stack at each sample and show you exactly which callers contribute to a function's cost. Certainly a sampler without this feature is much less useful.
Crashworks
@Crash: I just looked at the doc on VTune and xbperfview, and I don't see either one claiming to walk the stack at each sample, but even if they do, I don't see either one claiming to record, at the call instruction level, the fraction of samples containing that call instruction. This is very different from call graphs or showing frequencies at the call-graph level. Now, LukeStackwalker does actually walk the stack at each sample, but then, IT DISCARDS THE FREQUENCIES OF CALL INSTRUCTIONS! All these guys talk glibly about "hotspots" and "bottlenecks" as if these terms are well defined.
Mike Dunlavey
... and all the examples they give are tiny programs with shallow call stacks where the performance problems are indeed "hotspots", which I define as code where the PC is found a large percentage of the time, thus necessarily not containing call instructions. What profilers SHOULD do is give, for each INSTRUCTION on the call stack, ESPECIALLY call instructions, the fraction of samples that contain that instruction. In big software, the biggest performance wasters by far are avoidable function calls, but none of the profilers "get it". Why not???
Mike Dunlavey
... the world seems to think that a call-graph, annotated with call counts and/or average timing, is good enough. It is not. And the sad part is, for those that sample the call stack, the most useful information is right in front of them, but they throw it away, in the interests of "statistics".
Mike Dunlavey
I'm not sure what you're getting at. If I want a sampling profiler showing me inclusive and exclusive costs for each function, with an annotated cost graph, I use xbperfview. If I want instruction-level information, where every single opcode that passes through the CPU gets recorded and I can see which exact PC addresses get hit a lot, I use PIX, but that's so much data I can only collect a frame or two's worth.
Crashworks
Also, the performance issues I tend to look at are no more than 3-4% of the program's entire cost individually. We just don't have hotspots that are bigger chunks than that; instead we have to knock down a bunch of small issues so they can add up to a significant savings collectively.
Crashworks
@Crash: Well, it could be your code is that tight. I can't remember if I drew your attention to this: http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773The main point is that in good-size software, avoidable function calls dramatically outcost anything the PC does. Functions are not small, and if they call another function, they don't necessarily to so in a small number of places (or even visibly in the source code), so knowing inclusive time is only a hint at best, and exclusive time is mostly zero except at leaves. ...
Mike Dunlavey
... But every stack sample pinpoints the call instructions, and if a call instruction (not the instruction itself, but the time until the function returns) is on the stack some percent of the time (like 10%) then it will show up on that % of samples (on average), and that tells you what could be saved by killing it. This concentration on PC-level optimizing is missing the wide world of avoidable function calls, which are not hot spots. I liken function calls to credit cards. The card itself costs nothing, but what it really costs is it makes it easy to spend too much.
Mike Dunlavey
... Nowadays profilers are into "the call tree". That's a start, because at each point in the call tree it can indicate the exact statement where a function is called. The problem is, that statement can be in many branches of the call tree, so it's cost in any one branch is not representative of its true cost.
Mike Dunlavey
... Here is exactly what I would like a tool to tell me. I want it to list the addresses of call instructions, and tell me for each one, the total _inclusive_ time it was on the stack, in the interval of interest, as a percent of that interval. (Not execution count, not average call duration.) The list should be sorted in descending order by that percent, and only the first 100 or so need be shown. I define the instruction's cost as that percent. I know of no current tool that can give me that information (except one I built some years ago).
Mike Dunlavey
... and if you want to bring up recursion, I'll be glad to explain for the umpteenth time why that is not an issue.
Mike Dunlavey
I'm still unclear on what you mean by emphasis on the particular call instruction. My sampling profiler tells me that my program spends, say, 7% of its time in function F(), 60% of which comes from calls in function A() and 40% from calls in function B(). Under F() are G() and H(), which together add up to 6% of frame when called from F() and 1% otherwise. ( See dl.getdropbox.com/u/23059/jpix/xbperf.png ) I can use the trace tool to get instruction-level data, but usually that's unnecessary just to see which functions are being called way too often and which must be tightened.
Crashworks
I don't mean to disagree with your technique. Clearly I rely quite heavily on stack-walking sampling profilers. I'm just pointing out that there are some tools that do it in an automated way now, which is important when you're past the point of getting a function from 25% to 15% and need to knock it down from 1.2% to 0.6%.
Crashworks
@Crash: Thanks for staying with me here. Following your example, function F is on the stack 7% ot the time, and calls to F from A are on the stack 4.2% of the time. Now if you're lucky A only calls F in one site, and that site is visible in source. If A calls F from multiple sites, or if the calls are compiler-inserted, you still have to use brainpower to locate the site(s) accounting for most of that 4.2%, and those sites were available, on the stack. The link I gave you above shows the implications of this. I'd love it if I could collect 1000 stack samples, but done OK without.
Mike Dunlavey
... I forgot to mention, generally I want to sample on wall-clock time, not CPU time. If I'm doing too much I/O I need to know it.
Mike Dunlavey
... another case where stack sampling really pulls you outa the frying pan: notification-style programming, where there are layers of data to be kept synchronized, and "change handlers" get added on-the-fly (and then forgotten, and then re-added). Same goes for "properties" where you have no idea how much will happen when you get/set something, especially with virtual classes.
Mike Dunlavey
Mike Spivey has a really nice variation on gprof that handles recursion sensibly. I hate what we have to put up with from the GNU project. Bad tools! Ugh!
Norman Ramsey
What a wonderfully simple method - I wish I'd thought of it by myself. I love it!
Filmund
@Norman: I read Mike Spivey's abstract. It's good that he's handling recursion sensibly in the context of call graphs. The problem is he's still in the mindset of the original gprof, which values 1) functions, not instructions (losing state information), 2) accuracy of timing, not of problem location (necessitating efficiency), 3) call graphs as a summary (losing information), rather than examining representative program states (i.e. stack samples). The goal is to find the problem with accuracy, not to measure it with accuracy.
Mike Dunlavey
... Here's maybe a better explanation: http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802
Mike Dunlavey
Thank you SO much for this idea. I just used it and I was able to identify and ameliorate some serious bottlenecks light years faster than any other method I've tried in the past. I sped up execution by 60 times. I shudder at the thought of all the timing debugging code I was considering adding.
Emtucifor
@Emtucifor: That's great!
Mike Dunlavey
@Mike, I whole heartedly agree with you on sampling profilers. I love OProfile, it gets a bad rap as a kernel profile when it performs beautifully on any code. It operates on the same priciple your idea of randomly halting the debugger but it uses hardware interrupts and performance counters. It can also provide call stack information. Which is invaluable when the performance bottle neck is malloc, but the real problem is a tight loop that's needlessly allocating.
caspin
@Caspin: I'm not an expert on all profilers, but I'm glad you like OProfile. Many come close but miss the target, such as by sampling only on CPU time, thereby missing all optimization opportunities consisting of needless I/O. Or they persist with the idea of looking at functions, not lines of code, thereby discarding information needlessly and getting caught in the call-graph / recursion / self-time tarpit. And they get caught up in the "measurement accuracy" and "efficiency" tarpits.
Mike Dunlavey
-1: Neat idea, but if you're getting paid to work in even a moderately performance oriented environment this is a waste of everyone's time. Use a real profiler so we don't have to come along behind you and fix the actual problems.
280Z28
@280Z28: Maybe this isn't "real" profiling (only 43x speedup): http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773. Zoom is on the right track. Other "real" profilers are prone to common myths: http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
@280Z28: There is a certain tendency among programmers (myself included sometimes) to denigrate unusual viewpoints.
Mike Dunlavey
@280Z28: There are actual problems that this technique has trouble with, such as very message-driven asynchronous situations where the reason why the program is waiting is very hard to tease out. I haven't found any within-thread issues it can't nail, and often most profilers miss them completely (without careful detective work). But my mind is open - you got any?
Mike Dunlavey
@Mike Dunlavey: I downvoted because I in my experience "The technique described in this answer has a significantly worse than average balance of effort to results". Also, this article about concurrency profiling is long but well worth the read: http://msdn.microsoft.com/en-us/magazine/ee336027.aspx
280Z28
@280Z28: Yeah, I've had to resort to other methods when there's inter-process communication involved. Regarding "The technique described in this answer has a significantly worse than average balance of effort to results." I don't know who said that or if they really tried it or just had a prior opinion. Like I said, give me an example. Computer science is still a science, not a balance of opinions.
Mike Dunlavey
@Mike Dunlavey: That was a quote from my thought process in downvoting. My experience comes from in parallel computing and now in the game industry. Also, I wouldn't have downvoted you if the answer was voted heavy second to a detailed list of suggestions for real (scientific?) profiling. It's a thorough answer and absolutely correct in its own right; it just suffers from not helping someone looking for a quality profiler for C++ on Linux (perhaps someone who's used to Microsoft's extremely high quality one for Windows).
280Z28
@280Z28: Fair enough. If somebody's looking for a high-quality profiler for Linux, I think Zoom is there (nearly). I haven't seen a similar one on Windows, but that's not my main line of work. MS products tend to have really good quality, but that doesn't help if they are confused about what is the most effective information to give you. They still do things like looking only at CPU time and all the other stuff I've been trying to point out. Anyway, thanks for the exchange.
Mike Dunlavey
@Mike Dunlavey: The default profiling mode in Visual Studio (since 2005) is a low-overhead sampling profiler. Limited to just that, it makes Zoom appear rudimentary (that or the rotateright website/Zoom demo is severely lacking). Once you get into things like showing cross-thread dependencies and detailed blocking information (specific file IO, database queries, etc), and memory allocation and object lifetime tracking, the Microsoft one is scary good.
280Z28
@280Z28: If you've got to cross some water, and you have a rowboat and a Mercedes, which to you choose? One is rudimentary, the other has high quality but only works on its terms. Did you look at this? http://stackoverflow.com/questions/926266/performance-optimization-strategies-of-last-resort/927773#927773. The VS profiler can't do it (unless they've made big changes to it).
Mike Dunlavey
@Mike Dunlavey: If you don't mind, I don't want to get biased by reading your discussion of that problem - I'd like to take the original code and use strictly the VS tools to see where it gets me. I'll make note of the actual amount of developer time it takes me to make the improvements as well as the time it takes to run (graphed against each other).
280Z28
@280Z28: Good test, and you're right to try not to listen to my telling you where the problems actually are, so you'll rely on VS to tell you. I hope you don't gag on the old-fashioned code, sorry.
Mike Dunlavey
@Mike: Thank you for the example of Bayesian reasoning. In the last column of your tables, is `P(f >= x)` correct or do you really mean `P(f >= x | o=2/2)`?
unutbu
@~unutbu: Um, yeah, I guess. It's just the second-to-last column normalized to add up to 1.
Mike Dunlavey
+9  A: 

You can use valgrind with the following options

valgrind --tool=callgrind ./(Your binary)

It will generate a file called callgrind.out.x. You can then use kcachegrind tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.

Ajay
+2  A: 

You can use Sun Studio's collect/analyzer. Using collect you can also profile memory usage, threads, MPI, etc. You also get a nice timeline view of your program.

If you use these tools in Solaris you can also get hardware performance counter information like in vtune or oprofile.

You can get this (and other very useful tools from Sun) at: [http://developers.sun.com/sunstudio/downloads/express/index.jsp][1]

Carlos
+4  A: 

I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:

(wikipedia) Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor. Instead, Valgrind first translates the program into a temporary, simpler form called Intermediate Representation (IR), which is a processor-neutral, SSA-based form. After the conversion, a tool (see below) is free to do whatever transformations it would like on the IR, before Valgrind translates the IR back into machine code and lets the host processor run it.

Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probing profiler.

Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.

+6  A: 

Oprofile is a decent free option, but I've had better luck using Zoom. It's a commercial (free eval) profiler for Linux that has a sweeet GUI for seeing hotspots in your source code.

federal
** I'm a profiler-skeptic, but I have to admit, Zoom seems to be on the right track. I would have them butterfly lines of code, rather than routines, and get rid of the "Self" column. They should let you pick particular stack traces to look at. And, they take about 1000 times more samples than necessary. But they are on what I think is the right track (after all these years of gprof-ism).
Mike Dunlavey
+6  A: 

Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools)

These come with classic sampling profilers (man-page) as well as the awesome timechart!

alt text

Will
Great tool! Is there anyway for me to get a typical "butterfly" view that starts from "main->func1->fun2" style? I can't seem to figure that out... `perf report` seems to give me the function names with the call parents... (so it's sort of an inverted butterfly view)
kizzx2
+1  A: 

There is also LTTng (http://lttng.org/) I've never used that one though, so I cannot tell how well it works. But one advantage it has is an userspace tracer. In some situations that could be rather nice to have.

Michiel