tags:

views:

278

answers:

6

I am trying to figure out where my C++ program is spending its time, using gprof. Here's my dilemma: if I compile with the same optimization settings I use for my release build, pretty much everything gets inlined, and gprof tells me, unhelpfully, that 90% of my time is spent in a core routine, where everything was inlined. On the other hand, if I compile with inlining disabled, the program runs an order of magnitude slower.

I want to find out how much time procedures called from my core routine are taking, when my program is compiled with inlining enabled.

I am running 64-bit Ubuntu 9.04 on a quad-core Intel machine. I looked into google-perftools, but that doesn't seem to work well on x86_64. Running on a 32-bit machine is not an option.

Does anyone have suggestions as to how I can more effectively profile my application, when inlining is enabled?

Edit: Here is some clarification of my problem. I apologize if it was not clear initially.

I want to find where the time was being spent in my application. Profiling my optimized build resulted in gprof telling me that ~90% of the time is spent in main, where everything was inlined. I already knew that before profiling!

What I want to find out is how much time the inlined functions are taking, preferably, without disabling optimization or inlining in my build options. The application is something like an order of magnitude slower when profiling with inlining disabled. This difference in execution time is a convenience issue, but also, I am not confident that the performance profile of the program built with inlining disabled will strongly correspond to the performance profile of the program built with inlining enabled.

In short: is there a way to get useful profiling information on a C++ program without disabling optimization or inlining?

A: 

It doesn't matter that the code is running slower (your convenience aside, of course) - the profiler will still tell you the correct proportion of time spent in each function.

anon
another 10 up-votes for stating the obvious and adding no value -- whee :P
Hassan Syed
Maybe -- but then again, it's entirely possible that this isn't the case. If you consider function calls as a tree, it takes for granted that all subtrees benefit equally from inlining. In reality, this is almost always going to be false. The difference will be insignificant in some cases, but quite significant in others.
Jerry Coffin
that and most profilers are driven by hardware interrupts -- and include other noise.
Hassan Syed
If the inlining enables better optimization of the core function (which is very possible), then profiling without the inlining will be misleading.
Novelocrat
I don't understand why this one is being downvoted. What Neil stated is true. If you have a bottleneck, and turn inlining off, that bottleneck will be obvious in the profiler. You'll still be at least as slow as your slowest function. If your biggest slowdown is the calling overhead, you should probably pat yourself on the back and call it a day. If there really is no obvious place where the code is slow, then you have to judge whether it is worth optimizing for 2-3% improvements.
Steve
Any profiling is of course only a rough indicator of what will happen in real, optimised code. However, the proportion of time spent in each function is likely to be fairly representative of what is happening in the real code, and provides a good first approximation of which functions would benefit from hand optimisation, which is surely the object of the exercise.
anon
@Hassan Syed: Most profilers *do* include some noise, but it's not (even close to) unique to the situation of all the functions being generated inline -- i.e. it's true in general, not just the situation at hand.
Jerry Coffin
I am not confident that a program compiled with inlining disabled will be equivalent, modulo overall runtime, to a program where the compiler has inlined functions like crazy: inlining can enable further optimizations to take place, such as constant propagation, loop hoisting, loop fusion, etc.
Bradford Larsen
Good programmers use a profiler to get answers. _Really_ good programmers use a profiler to get insight.
MSalters
Mike Dunlavey
... Sorry to blather on, but I have a problem with the vocabulary we use to discuss performance. I would prefer we not talk about *functions* and how *slow* they are, but about *lines of code* and what *fraction of time* they *are responsible for* (or *cost*, for short). In my experience, all performance problems have a very precise location, and cost a healthy fraction of time, which is saved when they are "fixed", by removing them, invoking them less, or otherwise finding a better way to accomplish the same purpose.
Mike Dunlavey
... and *bottleneck* - implies some limiting resource that constrains how much of our legitimate business can be done at a time. Real performance problems are like those fractal curves (http://en.wikipedia.org/wiki/Koch_snowflake) where in the middle of each direct line you take a little side-trip, and in the middle of each line of that you take another, and so on. Each side-trip (function call) seems innocent, but in aggregate they multiply the length of the trip by a large factor, unnecessarily so. Each call is like a credit card purchase, tempting you to think it doesn't cost much.
Mike Dunlavey
... and many of them don't - they are a good thing - but some are killers and they don't advertise themselves in the source code. There's no way to guess reliably where they are. Stackshots catch them in the act. (http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802). Repeat: this has **nothing** to do with inlining - the entry/exit cost of function calls could be absolutely free and perfectly optimized. It has **everything** to do with the extra semantic work you don't realize you're asking for.
Mike Dunlavey
+1  A: 

Develop a few macros using the high performance timing mechanism of your CPU (e.g., x86) -- the routines that don't rely on system calls, and bind a single thread running your core loop to a specific CPU (set the affinity). You would need to implement the following macro's.

PROF_INIT //allocate any variables -- probably a const char
PROF_START("name") // start a timer
PROF_STOP() // end a timer and calculate the difference -- 
            // which you write out using a async fd

I had something like this that I placed in every function I was interested in, I made sure the macro's placed the timing calls into the context of the call tree -- this is possibly the most accurate way to profile.

Note:

This method is driven by your code -- and does not rely on an external tool to snoop your code in any way. Snooping, Sampling and interrupt driven profiling is inaccurate when it comes to small sections of code. Besides, you want to control where and when the timing data is collected -- like at specific constructs in your code, like loops, the beginning of a recursive call-chain or mass memory allocations.

-- edit --

You might be interested in the link from this answer to one of my questions.

Hassan Syed
+1. Should be worth a try at least.
Marcus Lindblom
+7  A: 

I assume what you want to do is find out which lines of code are costing you enough to be worth optimizing. That is very different from timing functions. You can do better than gprof.

Here's a fairly complete explanation of how to do it.

You can do it by hand, or use one of the profilers that can provide the same information, such as oprofile, and RotateRight/Zoom.

BTW, inlining is of significant value only if the routines being inlined are small and don't call functions themselves, and if the lines where they are being called are active enough of the time to be significant.

As for the order of magnitude performance ratio between debug and release build, it may be due to a number of things, maybe or maybe not the inlining. You can use the stackshot method mentioned above to find out for certain just what's going on in either case. I've found that debug builds can be slow for other reasons, like recursive data structure validation, for example.

Mike Dunlavey
(+1) nice info, especially the stackshot link (which says the key is capturing and not sampling), although all the tools seem to be measuring rather than capturing -- stackshot too.
Hassan Syed
@Hassan: Yeah. I think of it as a matter of emphasis, on the precision of location, rather than on the precision of measurement. Oprofile and Zoom take stack samples on wall-clock time, and they can give you, per statement, the % of samples containing that statement. If they give it from a lot of samples, that's OK but not really necessary. Personally I do it by hand, because then I can also see more state information such as the data being worked on. Often that is crucial information.
Mike Dunlavey
Yes I am in agreement, I do this stuff by hand as well -- and consensus in profiling game engines / embedded work / OS routines and RTOS stuff pretty much uses this approach as well.
Hassan Syed
@Hassan: No kidding? Boy, they've been awful quiet on SO. I'm sure in high performance graphics every cycle counts. I've been using this method since doing assembler on a mini in '78. Even then, the hardware guys wanted to provide me with timers for performance tuning. I said, thanks but no thanks. A "pause" button, and "single step" are not only the simplest tools, but the most effective.
Mike Dunlavey
The profile build I was attempting was build with identical options as my release build, except for the addition of -fno-inlining and -pg.I did not explicitly inline any of my functions; I have left that to the compiler, under the assumption that it can do a better job than I can at deciding what is appropriate to inline.Because of all the inlining the compiler was doing, I was unable to find out even which functions were taking up time---they all got inlined. Profiling with inlining explicitly disabled let me get this info.Finding out which lines are expensive would also be useful.
Bradford Larsen
@Bradford: I would just take stackshots either way. They can tell exactly what's going on. Maybe the issue is inlining, but frankly that would really surprise me. More likely it is something I would never have been able to guess. Don't think too hard about it - just do it - and see what it tells you. It's easy. I would start with the slow version. Something is costing 90% of the time, so you will see exactly what it is as soon as you pause it, with 90% certainty. Pausing it a few more times will leave no doubt.
Mike Dunlavey
... I'll be interested to hear what you find out.
Mike Dunlavey
@Mike: My problem was not stated as clearly as it could have been initially. My issue is that attempting to profile with the same optimization settings as my release build results in gprof telling me nothing useful: almost all the code of my inner loop gets inlined into the body, and gprof just tells me ``gee, you spend all your time in your main loop.'' See my edited question for more information.
Bradford Larsen
@Mike: When I have more time (paper deadline in 2 days), I will try the ``stackshot'' method you suggest.In the meantime, I compiled with profiling support and with inlining explicitly disabled, but other build options the same as in my release build.When running this build, my application ran many times slower, but gprof was able to give me more useful information than ``90% of the time is spent in the main loop, which is the whole program''. It turns out a big culprit is code that empties a hash table, taking about 25% of the total run time. The rest of the profile was fairly flat.
Bradford Larsen
@Bradford: I'm glad **gprof** gave you some insight. Stackshots should be able to give more. I just got a comment on this answer (http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024) where the user claims a 60x speedup.
Mike Dunlavey
+1  A: 

You can use a more powerful profiler, such as Intel's VTune, which can give you assembly line level of performance detail.

http://software.intel.com/en-us/intel-vtune/

It's for Windows and Linux, but does cost money...

Inverse
yeh, that would do the trick -- in theory that is. those tools are horribly unreliable in windows -- they gave me BSOOD's on win2003 and xp on many machines and generally caused other odities as well.
Hassan Syed
+1  A: 

Would valgrind be any more helpful?

Dmitry
A: 

You can use gcov to give you line-by-line execution counts. This should at least tell you which inlined functions are the bottleneck.

ergosys
I played around with gcov a little bit, at your suggestion. It gives call counts, which is not enough to tell me what's taking lots of time.
Bradford Larsen