views:

1367

answers:

6

I'm trying to find open source profilers rather than using one of the commercial profilers which I have to pay $$$ for. When I performed a search on SourceForge, I have come across these four C++ profilers that I thought were quite promising:

  1. Shiny: C++ Profiler
  2. Low Fat Profiler
  3. Luke Stackwalker
  4. FreeProfiler

I'm not sure which one of the profilers would be the best one to use in terms of learning about the performance of my program. It would be great to hear some suggestions.

+5  A: 

You could try Windows Performance Toolkit. Completely free to use. This blog entry has an example of how to do sample-based profiling.

Michael
I just looked into it, but I found out that I need to be running Windows Vista or Server 2008 on my computer in order to be able to install it. As I don't want to install Windows Vista on the laptop that I'm using for development, which is running XP, I don't think I can personally go with this option. Thanks for the suggestion anyway.
stanigator
I liked Very Sleepy or even Luke Stackwolker better. xPerf sounds nice in theory, but in practice it is very awkward to use - hard to get running, slowly processing gather data.
Suma
+1  A: 

We use LtProf and have been happy with it. Not open source, but only $$, not $$$ :-)

puetzk
+5  A: 
Larry Gritz
+3  A: 

There's more than one way to do it.

Don't forget the no-profiler method.

Most profilers assume you need 1) high statistical precision of timing (lots of samples), and 2) low precision of problem identification (functions & call-graphs).

Those priorities can be reversed. I.e. the problem can be located to the precise machine address, while cost precision is a function of the number of samples.

Most real problems cost at least 10%, where high precision is not essential.

Example: If something is making your program take 2 times as long as it should, that means there is some code in it that costs 50%. If you take 10 samples of the call stack while it is being slow, the precise line(s) of code will be present on roughly 5 of them. The larger the program is, the more likely the problem is a function call somewhere mid-stack.

It's counter-intuiitive, I know.

NOTE: xPerf is nearly there, but not quite (as far as I can tell). It takes samples of the call stack and saves them - that's good. Here's what I think it needs:

  • It should only take samples when you want them. As it is, you have to filter out the irrelevant ones.

  • In the stack view it should show specific lines or addresses at which calls take place, not just whole functions. (Maybe it can do this, I couldn't tell from the blog.)

  • If you click to get the butterfly view, centered on a single call instruction, or leaf instruction, it should show you not the CPU fraction, but the fraction of stack samples containing that instruction. That would be a direct measure of the cost of that instruction, as a fraction of time. (Maybe it can do this, I couldn't tell.) So, for example, even if an instruction were a call to file-open or something else that idles the thread, it still costs wall clock time, and you need to know that.

NOTE: I just looked over Luke Stackwalker, and the same remarks apply. I think it is on the right track but needs UI work.

ADDED: Having looked over LukeStackwalker more carefully, I'm afraid it falls victim to the assumption that measuring functions is more important than locating statements. So on each sample of the call stack, it updates the function-level timing info, but all it does with the line-number info is keep track of min and max line numbers in each function, which, the more samples it takes, the farther apart those get. So it basically throws away the most important information - the line number information. The reason that is important is that if you decide to optimize a function, you need to know which lines in it need work, and those lines were on the stack samples (before they were discarded).

One might object that if the line number information were retained it would run out of storage quickly. Two answers. 1) There are only so many lines that show up on the samples, and they show up repeatedly. 2) Not so many samples are needed - the assumption that high statistical precision of measurement is necessary has always been assumed, but never justified.

I suspect other stack samplers, like xPerf, have similar issues.

Mike Dunlavey
+1  A: 

It's not open source, but AMD CodeAnalyst is free. It also works on Intel CPUs despite the name. There are versions available for both Windows (with Visual Studio integration) and Linux.

Soo Wei Tan
+1  A: 

From those who have listed, I have found Luke Stackwalker to work best - I liked its GUI, it was easy to get running.

Other similar is Very Sleepy - similar functionality, sampling seems more reliable, GUI perhaps a little bit harder to use (not that graphical).


After spending some more time with them, I have found one quite important drawback. While both try to sample at 1 ms resolution, in practice they do not achieve it because their sampling method (StackWalk64 of the attached process) is way too slow. For my application it takes something like 5-20 ms to get a callstack. Not only this makes your results imprecise, it also makes them skewed, as short callstacks are walked faster, therefore tend to get more hits.

Suma
Hi again, Suma. IMO approximate percent of total time used by each line of code (or function, if you like) is more important to know than precise absolute time, and percent of time should not be affected by sampling overhead or frequency of sampling.
Mike Dunlavey
The major problem is "short callstacks are walked faster". This makes sampling frequency variable depending on call stack depth when attempting to sample at 1 ms, which is definitely a bad thing. Sampling at 20 ms would be possible, but that is too coarse for me.I have implemented a different sampling method in a derived version (not published anywhere), where application samples its callstacks on its own and sends results to the Sleepy profiler through a named pipe. This way the sampling is about 1000x faster (1-5 us), which makes 1ms sampling working very well.
Suma
It's natural to assume you have to take as many samples as possible as fast as possible (for accuracy of time estimation), but if some activity is taking 10% of the time, it will appear on roughly 10% of samples, whether you take 20 samples or 20,000, *provided the samples happen at unpredictable times* with respect to what the program is doing. So you won't have any trouble finding it, it will jump out at you. http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802
Mike Dunlavey
We have already met in a discussion like this before, so I will be repeating myself. Your assumptions about the type of workload I am profiling are wrong. What I am interested about most is the performance in the slowest frames. Such frames do not represent 10 %, but a lot less. Individual functions contributing to the problem often represent only 1 % or less of the total application running time. The slow frame duration is usually 50-100 ms. Having them sampled at 20 is absolutely in-acceptable.
Suma
Your technique is very good when the application workload is more or less constant, even even more if the bad performance is caused only by a few individual functions. This may be quite common in your domain, but it is not common in my domain.
Suma
"Normally the slowness itself causes that". As explained, it does cause it in my case, therefore for my workload the method is useless.
Suma
Sorry to be a pest. The method is not limited to a particular type of workload. It is only necessary that the samples occur preferentially during the slow code, and the slowness itself causes that. If you're only interested in a particular chunk of code or circumstance, then there is the technique of temporarily amplifying it by adding an outer loop so it is easier to sample. I know this technique may be hard to understand without actually doing it.
Mike Dunlavey
Right. The slowness causes it to use a higher percentage of time, which exposes it to a higher percentage of samples. Thus, you can see why it's slow.
Mike Dunlavey
No. The high percentage of time within a frame can be seen only in absolutely minor percentage of frames. Therefore the total percentage is still low and any low-frequency sampling is likely to miss this, and high frequency sampling is likely to find it. "I know this technique may be hard to understand without actually doing it. " I am aware of this method, I am using it sometimes and there are situations where I find it useful, but it is really no panacea and it tires me to see you propagating it as such. It is absolutely useless against some types of "slowness", which are common in my job.
Suma
Unabled to edit old comment: "As explained, it does cause it in my case:" should read "As explained, it does NOT cause it in my case" - sorry for such meaning reversing typo.
Suma
Don't want to tire you. So you *occasionally* have a frame that takes too long. FWIW, I would amplify each frame by say 100 times with an outer loop (basically speeding up my cognition by 100 times), and when a frame took too long, stab it. Clumsy, I know, but that's what I would do. No need to respond.
Mike Dunlavey
This cannot work, as frames are impossible to "repeat". The world state at the end of the frame is completely different from what it was at the frame beginning. You would need to save this state and roll back in time, which would be so time consuming it would hide all other issues.It is really absolutely impossible to you to see that this technique really sometimes fails to be be useful and sometimes sampling profiler really does a much better job?
Suma
OK, then if you can't amplify a frame's execution without changing the semantics, that makes the problem harder. Then, I would try to use an alarm-clock type interrupt. If a fast frame takes 10ms, and the slow one takes 50, at the start of a frame set a (20+-random)ms alarm, and at the end of the frame, clear the alarm. That way the alarm goes off only in a slow frame, and you grab the sample. The timer setup and grabbing doesn't have to be fast, because it doesn't affect what the frame code is doing. I'm thick, but what's wrong with that?
Mike Dunlavey
I'm only beating this drum because performance is important and people have some silly ideas about it. This just says what you need to know is, *when the code is taking too long*, ask *what it is doing, semantically* on a *percentage-of-time* basis. To ask this, tricks may be necessary, like amplifying or artificially slowing down execution. These don't change the percentages much. It's a different way of thinking, and plenty of comments on SO report substantial results. I think you understand this.
Mike Dunlavey
Yes there are problems that random state / stack sampling during slowness will not diagnose. Those are the ones where the waiting-chain spans across asynchronous processes, so it is harder to determine who is waiting for whom, to see if what's being waited for isn't really needed.
Mike Dunlavey