views:

50

answers:

2

I tried profiling my application with OProfile recently. The data gathered is already very valuable to me, but I'm having difficulties with its precise interpretation. After running my app with oprofile set up and running, I generated the report and got:

root@se7xeon:src# opreport image:test -l -t 1
Overflow stats not available
CPU: P4 / Xeon with 2 hyper-threads, speed 3191.66 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 750000
samples % symbol name
215522 84.9954 cci::Image::interpolate(unsigned char*, cci::Matrix const&) const
17998 7.0979 cci::Calc::diff(unsigned char const*, unsigned char const*)
13171 5.1942 cci::Image::getIRect(unsigned char*, int, int) const
5519 2.1765 cci::Image::getFRect(unsigned char*, double, double) const

Okay, so my interpolation function is responsible for 84% of the application's (too long) execution time. Seems a good idea to look into it then:

root@se7xeon:src# opannotate image:test --source
[...]

/* cci::Image::interpolate(unsigned char*, cci::Matrix<cci::Point2DF> const&) const total: 215522   84.9954 */  
1392  0.5529 :void Image::interpolate(CCIPixel *output, const Matrix<Point2DF> &inputPoints) const throw()  
4  0.0016 :{  
[...]  
:                col0 = static_cast<int>(point[idx].x);  
3  0.0012 :      col1 = col0+1;  
629  0.2498 :    row0 = static_cast<int>(point[idx].y);  
385  0.1529 :    row1 = row0+1;  
56214 22.3266 :  if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:                {  
:                        col0 = row0 = col1 = row1 = 0;  
:                }  

If I understand correctly, the if conditional is responsible for over 22% of the program's execution time. The opening brace and the function declaration seem to take time, is that supposed to correspond to the function call overhead ("push parameters on stack, jump, pop parameters" sequence)?

I changed some things in the source (related to a later bottleneck because I had no idea how to optimize an if), recompiled, ran through oprofile again (not forgetting opcontrol --reset). Now the annotated code looks like this in the same place:

6  0.0024 :     curPx = point[idx].x;  
628  0.2477 :   curPy = point[idx].y;  
410  0.1617 :   col0 = static_cast<int>(curPx);  
57910 22.8380 : col1 = col0+1;  
:               row0 = static_cast<int>(curPy);  
:               row1 = row0+1;  
:               if (col0 < 0 || col1 >= m_width || row0 < 0 || row1 >= m_height)  
:               {  
:                   col0 = row0 = col1 = row1 = 0;  
:               }  

This time the if takes basically no time at all (?), the most expensive instruction is "col1 = col0 + 1", and the whole time-taking block seems to have shifted upwards. How can this be? Can this be trusted at all to pinpoint bottlenecks in the source?

An another point of doubt for me is that when I set up opcontrol, I entered the traced event as GLOBAL_POWER_EVENTS, with the number of samples being 750k. In the output, the interpolation function seems to take 84%, but the number of samples recorded inside it is only a little bit above 200k. That isn't even 50% of the requested number. Am I to understand that the remaining ~500k samples was taken by applications not listed in the output (kernel, Xorg, etc.)?

+1  A: 

When profiling optimized code you really cannot rely on accurate source code lines. The compiler moves stuff around far too much.

For an accurate picture you will need to look at the code disassembler output.

Zan Lynx
Will it make sense if I profile the unoptmized version instead? Or will the bottlenecks discovered therein be totally unrelated to the optimized version? Also, any way to get percentage-annotated disassembler output? ;)
neuviemeporte
Wow, apparently there's an option to opannotate called --assembly, thanks!
neuviemeporte
A: 

OProfile can (they tell me) get stack samples on wall-clock time (not CPU), and it can give you line-level percentages. What you are looking for is lines that are contained on a large percent of stack samples.

I wouldn't turn on compiler optimization until after I finished hand-tuning the code, because it just hides things.

When you say the interpolate routine uses 84% of the time, that triggers a question. The entire program takes some total time, right? It takes 100% of that time. If you cut the program's time in half, or if you double it, it will still take 100% of the time. Whether 84% for interpolation is too much or not depends on whether it is being done more than necessary.

So I would suggest that you not ask if the percent of a routine is too much. Rather you look for lines of code that take a significant amount of time and ask if they could be optimized. See the difference? After you optimize the code, it can make a large reduction in overall run time, but it might still be a large percent, of a smaller total. Code isn't optimal when nothing takes a large percent. Code is optimal when of all the things that take a large percent, none can be improved.

I don't care for things that just give numbers. What I want is insight. For example, if that routine accounts for 84% of the time, then if you took 10 samples of the stack, it would be on 8.4 of them. The exact number doesn't matter. What matters is to understand why it was in there. Was it really really necessary to be in there so much? That's what looking at the stack samples can tell you. Maybe you're actually doing the interpolation twice as often as necessary? Often people find out, by analyzing the why, that the routine they're trying to speed up didn't need to be called nearly as much, maybe not at all. I can't guess in your case. Only the insight from examining the program's state can tell you that.

Mike Dunlavey