views:

143

answers:

4

The profilers I have experience with (mainly the Digital Mars D profiler that comes w/ the compiler) seem to massively slow down the execution of the program being profiled. This has a major effect on my willingness to use a profiler, as it makes profiling a "real" run of a lot of my programs, as opposed to testing on a very small input, impractical. I don't know much about how profilers are implemented. Is a major (>2x) slowdown when profiling pretty much a fact of life, or are there profilers that avoid it? If it can be avoided, are there any fast profilers available for D, preferrably for D2 and preferrably for free?

+13  A: 

I don't know about D profilers, but in general there are two different ways a profiler can collect profiling information.

The first is by instrumentation, by injecting logging calls all over the place. This slows down the application more or less. Typically more.

The second is sampling. Then the profiler breaks the application at regular intervals and inspects the call stack. This does not slow down the application very much at all.

The downside of a sampling profiler is that the result is not as detailed as with an instrumenting profiler.

Check the documentation for your profiler if you can run with sampling instead of instrumentation. Otherwise you have some new Google terms in "sampling" and "instrumenting".

Albin Sunnanbo
++ Yeah, sampling rules! But not just any old sampling. Sampling the call stack, on wall-clock time, and reporting by line (not function) percent of samples containing that line. And the extra detail you get with instrumentation is only noise anyway, if your goal is to find code in need of optimizing.
Mike Dunlavey
For this reason I tend to strongly prefer sampling profilers wherever practical -- they tend to provide lots of reliable data for a relatively smaller cost in performance, and I can run it for a long time to average out results from a stochastic, noisy process (like a function that takes 1ms 90% of the time it's called, and 500ms the other 10%).
Crashworks
I find the exact function call count that instrumenting profilers give useful from time to time. If I load 120 customers in a list I look a the functions that is called 120 (and worse n x 120) times. I can then find functions called by mistake (i.e. triggered by an event). Or calls made for each items where one call for all of them should be sufficient.
Albin Sunnanbo
+1  A: 

I'd say yes, both sampling and instrumenting forms of profiling will tax your program heavily - regardless of whose profiler you are using, and on what language.

Brent Arias
@Mystagogue: many sampling profilers can scale how frequently they take their samples. This allows the overhead to be considerably less than instrumentation profiling (on the order of 1-5%). The quality of the data depends on the sampling rate, of course. Both forms of profiling do have non-zero impact, but sampling is significantly less.
Chris Schmich
+1  A: 

You could try h3r3tic's xfProf, which is a sampling profiler. Haven't tried it myself, but that guy always makes cool stuff :)

From the description:

If the program is sampled only a few hundred (or thousand) times per second, the performance overhead will not be noticeable.

torhu
Thanks. This should be an improvement (haven't tested it yet), though it still says you need to compile w/o optimizations, which probably has more than 2x perf. cost.
dsimcha
@dsimcha: @torhu: I just checked the xfProf site. I'm sad. It's built on the same principles as gprof. Here's why that makes me sad: http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
+1  A: 

My favorite method of profiling slows the program way way down, and that's OK. I run the program under the debugger, with a realistic load, and then I manually interrupt it. Then I copy the call stack somewhere, like to Notepad. So it takes on the order of a minute to collect one sample. Then I can either resume execution, or it's even OK to start it over from the beginning to get another sample.

I do this 10 or 20 times, long enough to see what the program is actually doing from a wall-clock perspective. When I see something that shows up a lot, then I take more samples until it shows up again. Then I stop and really study what it is in the process of doing and why, which may take 10 minutes or more. That's how I find out if that activity is something I can replace with more efficient code, i.e. it wasn't totally necessary.

You see, I'm not interested in measuring how fast or slow it's going. I can do that separately with maybe only a watch. I'm interested in finding out which activities take a large percentage of time (not amount, percentage), and if something takes a large percentage of time, that is the probability that each stackshot will see it.

By "activity" I don't necessarily mean where the PC hangs out. In realistic software the PC is almost always off in a system or library routine somewhere. Typically more important is call sites in our code. If I see, for example, a string of 3 calls showing up on half of the stack samples, that represents very good hunting, because if any one of those isn't truly necessary and can be done away with, execution time will drop by half.

If you want a grinning manager, just do that once or twice.

Even in what you would think would be math-heavy scientific number crunching apps where you would think low-level optimization and hotspots would rule the day, you know what I often find? The math library routines are checking arguments, not crunching. Often the code is not doing what you think it's doing, and you don't have to run it at top speed to find that out.

Mike Dunlavey
Less useful when you have a rather flat profile where no one component takes more than 2% of execution time by itself, but you have to somehow get your main loop down from 40ms to 33ms by fixing fifty small things. But that's a pretty specialized case.
Crashworks
Also you really should take a look at modern sampling profilers. I keep recommending them to you not because I think your method is poor, but because I totally agree with your approach and am happy that there are automatic tools that can do it 1000 samples per second.
Crashworks
@Crashworks: Thanks. My experience is that you only get to the point of having tiny little things to optimize after a process of getting rid of some pretty massive things you never guessed were in there. As far as tools, I think Zoom and LTProf have the right idea, but as I've tried to explain before, you don't need a large number of samples unless your biggest problems are really really small. Also, I don't know any tool that lets you examine a representative stack sample in detail, and the program context in effect at the time it was taken. That gives you insight that numbers can't.
Mike Dunlavey
@Crashworks: In your example, if the loop runs 1000 times, you want the time to go from 40s to 33s. That means removing 18% of what it's doing. If a frame takes 40ms, that's enough time for a fairly bushy call tree, so I would want to make sure there weren't any branches I could lop off. When I only get down to trimming leaves, I'm close to the point of diminishing returns. Fortunately, there's the magnification effect. Any improvement you make makes remaining opportunities more obvious.
Mike Dunlavey