views:

296

answers:

8

On SO, there are quite a few questions about performance profiling, but I don't seem to find the whole picture. There are quite a few issues involved and most Q & A ignore all but a few at a time, or don't justify their proposals.

What Im wondering about. If I have two functions that do the same thing, and Im curious about the difference in speed, does it make sense to test this without external tools, with timers, or will this compiled in testing affect the results to much?

I ask this because if it is sensible, as a C++ programmer, I want to know how it should best be done, as they are much simpler than using external tools. If it makes sense, lets proceed with all the possible pitfalls:

Consider this example. The following code shows 2 ways of doing the same thing:

#include <algorithm>
#include <ctime>
#include <iostream>

typedef unsigned char byte;

inline
void
swapBytes( void* in, size_t n )
{
   for( size_t lo=0, hi=n-1; hi>lo; ++lo, --hi )

      in[lo] ^= in[hi]
   ,  in[hi] ^= in[lo]
   ,  in[lo] ^= in[hi] ;
}

int
main()
{
         byte    arr[9]     = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h' };
   const int     iterations = 100000000;
         clock_t begin      = clock();

   for( int i=iterations; i!=0; --i ) 

      swapBytes( arr, 8 );

   clock_t middle = clock();

   for( int i=iterations; i!=0; --i ) 

      std::reverse( arr, arr+8 );

   clock_t end = clock();

   double secSwap = (double) ( middle-begin ) / CLOCKS_PER_SEC;
   double secReve = (double) ( end-middle   ) / CLOCKS_PER_SEC;


   std::cout << "swapBytes,    for:    "   << iterations << " times takes: " << middle-begin
             << " clock ticks, which is: " << secSwap    << "sec."           << std::endl;

   std::cout << "std::reverse, for:    "   << iterations << " times takes: " << end-middle
             << " clock ticks, which is: " << secReve    << "sec."           << std::endl;

   std::cin.get();
   return 0;
}

// Output:

// Release:
//  swapBytes,    for: 100000000 times takes: 3000 clock ticks, which is: 3sec.
//  std::reverse, for: 100000000 times takes: 1437 clock ticks, which is: 1.437sec.

// Debug:
//  swapBytes,    for: 10000000 times takes: 1781  clock ticks, which is: 1.781sec.
//  std::reverse, for: 10000000 times takes: 12781 clock ticks, which is: 12.781sec.

The issues:

  1. Which timers to use and how get the cpu time actually consumed by the code under question?
  2. What are the effects of compiler optimization (since these functions just swap bytes back and forth, the most efficient thing is obviously to do nothing at all)?
  3. Considering the results presented here, do you think they are accurate (I can assure you that multiple runs give very similar results)? If yes, can you explain how std::reverse gets to be so fast, considering the simplicity of the custom function. I don't have the source code from the vc++ version that I used for this test, but here is the implementation from GNU. It boils down to the function iter_swap, which is completely incomprehensible for me. Would this also be expected to run twice as fast as that custom function, and if so, why?

Contemplations:

  1. It seems two high precision timers are being proposed: clock() and QueryPerformanceCounter (on windows). Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements. This page on the gnu c library seems to contradict that, but when I put a breakpoint in vc++, the debugged process gets a lot of clock ticks even though it was suspended (I have not tested under gnu). Am I missing alternative counters for this, or do we need at least special libraries or classes for this? If not, is clock good enough in this example or would there be a reason to use the QueryPerformanceCounter?

  2. What can we know for certain without debugging, dissassembling and profiling tools? Is anything actually happening? Is the function call being inlined or not? When checking in the debugger, the bytes do actually get swapped, but I'd rather know from theory why, than from testing.

Thanks for any directions.

update

Thanks to a hint from tojas the swapBytes function now runs as fast as the std::reverse. I had failed to realize that the temporary copy in case of a byte must be only a register, and thus is very fast. Elegance can blind you.

inline
void
swapBytes( byte* in, size_t n )
{
   byte t;

   for( int i=0; i<7-i; ++i )
    {
        t       = in[i];
        in[i]   = in[7-i];
        in[7-i] = t;
    }
}

Thanks to a tip from ChrisW I have found that on windows you can get the actual cpu time consumed by a (read:your) process trough Windows Management Instrumentation. This definitely looks more interesting than the high precision counter.

+1  A: 

Is there something you have against profilers? They help a ton. Since you are on WinXP, you should really give a trial of vtune a try. Try a call graph sampling test and look at self time and total time of the functions being called. There's no better way to tune your program so that it's the fastest possible without being an assembly genius (and a truly exceptional one).

Some people just seem to be allergic to profilers. I used to be one of those and thought I knew best about where my hotspots were. I was often correct about obvious algorithmic inefficiencies, but practically always incorrect about more micro-optimization cases. Just rewriting a function without changing any of the logic (ex: reordering things, putting exceptional case code in a separate, non-inlined function, etc) can make functions a dozen times faster and even the best disassembly experts usually can't predict that without the profiler.

As for relying on simplistic timing tests alone, they are extremely problematic. That current test is not so bad but it's a very common mistake to write timing tests in ways in which the optimizer will optimize out dead code and end up testing the time it takes to do essentially a nop or even nothing at all. You should have some knowledge to interpret the disassembly to make sure the compiler isn't doing this.

Also timing tests like this have a tendency to bias the results significantly since a lot of them just involve running your code over and over in the same loop, which tends to simply test the effect of your code when all the memory in the cache with all the branch prediction working perfectly for it. It's often just showing you best case scenarios without showing you the average, real-world case.

Depending on real world timing tests is a little bit better; something closer to what your application will be doing at a high level. It won't give you specifics about what is taking what amount of time, but that's precisely what the profiler is meant to do.

I used profilers before to optimize performance of entire programs, but given curiosity about a few simple fuctions, calling a few timers is definitely less hassle than choosing, downloading, installing, reading the manual and get working with a profiler. Altogether, there is a difference between understanding underlying stuff like this and getting your software working at reasonable performance. For the latter I would happily use a profiler and the speed of std::reverse would most likely not worry me at all, unless if I was reversing gigabytes...
ufotds
If you are just going for acceptable performance and not exceptional performance, then a timing test might do. However, it's important to keep in mind that while a profiler might take a little time to learn, it's just something you have to do once. In vtune, just use the call graph sampling wizard, select your exe file, and run it. The only tricky part is that you need to modify your projects settings (http://software.intel.com/en-us/articles/performance-tools-for-software-developers-using-the-intel-compilers-with-vtune-analyzer-or-intel-thread-profiler/). After that just run and see graph.
... self time will then tell you how much time the cpu spends in a given function/class method excluding calls to other functions/methods, and total time will give you the total amount of time spent in a function/method including the time spent to call other functions/methods. It's just like a timing test except you get the time spent for every function called in your test including the total time spent in main.
Can you recommend a good, *free* profiler for Windows? Something like gprof under unix?
zvrba
CodeAnalyst is free if you have an AMD machine. If you have intel, you can still use it but only for TBS (time-based sampling).
@zvrba: Please take a look at this: http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
+1  A: 

I would suppose that anyone competent enough to answer all your questions is gong to be far too busy to answer all your questions. In practice it is probably more effective to ask a single, well-defined questions. That way you may hope to get well-defined answers which you can collect and be on your way to wisdom.

So, anyway, perhaps I can answer your question about which clock to use on Windows.

clock() is not considered a high precision clock. If you look at the value of CLOCKS_PER_SEC you will see it has a resolution of 1 millisecond. This is only adequate if you are timing very long routines, or a loop with 10000's of iterations. As you point out, if you try and repeat a simple method 10000's of times in order to get a time that can be measured with clock() the compiler is liable to step in and optimize the whole thing away.

So, really, the only clock to use is QueryPerformanceCounter()

ravenspoint
+2  A: 

Obviously we would like to measure the cpu time of our code and not the real time, but as far as I understand, these functions don't give that functionality, so other processes on the system would interfere with measurements.

I do two things, to ensure that wall-clock time and CPU time are approximately the same thing:

  • Test for a significant length of time, i.e. several seconds (e.g. by testing a loop of however many thousands of iterations)

  • Test when the machine is more or less relatively idle except for whatever I'm testing.

Alternatively if you want to measure only/more exactly the CPU time per thread, that's available as a performance counter (see e.g. perfmon.exe).

What can we know for certain without debugging, dissassembling and profiling tools?

Nearly nothing (except that I/O tends to be relatively slow).

ChrisW
perfmon, yes thanks for reminding me. I knew it existed, and it is pretty convenient, but do you know if there are system calls we can use to get this information inside our program?
ufotds
@ufotds - When I did it, long ago, I used hairy calls to read the hidden "performance" section of the registry (the calls were easy, but parsing the binary data which they returned was not). Nowadays it might, I don't know, be abstracted by the "WMI" API.
ChrisW
+2  A: 

To answer you main question, it "reverse" algorithm just swaps elements from the array and not operating on the elements of the array.

tojas
+2  A: 

Is it safe to say you're asking two questions?

  • Which one is faster, and by how much?

  • And why is it faster?

For the first, you don't need high precision timers. All you need to do is run them "long enough" and measure with low precision timers. (I'm old-fashioned, my wristwatch has a stop-watch function, and it is entirely good enough.)

For the second, surely you can run the code under a debugger and single-step it at the instruction level. Since the basic operations are so simple, you will be able to easily see roughly how many instructions are required for the basic cycle.

Think simple. Performance is not a hard subject. Usually, people are trying to find problems, for which this is a simple approach.

Mike Dunlavey
yes, more than 2 even... but for some reason the visual debugger wouldn't let me step into std::reverse, but I had only tried in release mode. Now in debug it works and I can actually see that it does about exactly what I wrote in the update of swapBytes, apart from verifying poiners etc...
ufotds
+2  A: 

Use QueryPerformanceCounter on Windows if you need a high-resolution timing. The counter accuracy depends on the CPU but it can go up to per clock pulse. However, profiling in real world operations is always a better idea.

DeadMG
It also depends on when it is called. Many CPUs change clock frequency dynamically.
Pavel Radzivilovsky
A: 

Wha? How to measure speed without a profiler? The very act of measuring speed is profiling! The question amounts to, "how can I write my own profiler?" And the answer is clearly, "don't".

Besides, you should be using std::swap in the first place, which complete invalidates this whole pointless pursuit.

-1 for pointlessness.

John
std::reverse is a wrapper around std::swap...
ufotds
I didn't downvote, but one thing I've learned on SO is, go easy on people. We all come in with different levels of background, and we can share others' wisdom. Clearly you have wisdom to share. That's a good thing about SO.
Mike Dunlavey
Mike: Point taken. You are more patient than I am. That aside, do YOU think this question is valid? I'm quickly learning that sensible questions are rare on here. The optimization questions alone make worry about what applications these folks are programming. I hope my bank isn't employing programmers to wonder whether they should roll their own std::swap! :)
John
ufotds: So what? Substitute "std::reverse" for "std::swap" in my post. The OP should be using std::whatever instead of worrying about rolling his own, worrying about whether is own is faster, and -- the part that just makes no sense -- avoiding using a profiler.
John
@John: I am amazed that applications work as well as they do, given the silly ideas programmers have on many subjects, including performance.
Mike Dunlavey
+2  A: 

(This answer is specific to Windows XP and the 32-bit VC++ compiler.)

The easiest thing for timing little bits of code is the time-stamp counter of the CPU. This is a 64-bit value, a count of the number of CPU cycles run so far, which is about as fine a resolution as you're going to get. The actual numbers you get aren't especially useful as they stand, but if you average out several runs of various competing approaches then you can compare them that way. The results are a bit noisy, but still valid for comparison purposes.

To read the time-stamp counter, use code like the following:

LARGE_INTEGER tsc;
__asm {
    cpuid
    rdtsc
    mov tsc.LowPart,eax
    mov tsc.HighPart,edx
}

(The cpuid instruction is there to ensure that there aren't any incomplete instructions waiting to complete.)

There are four things worth noting about this approach.

Firstly, because of the inline assembly language, it won't work as-is on MS's x64 compiler. (You'll have to create a .ASM file with a function in it. An exercise for the reader; I don't know the details.)

Secondly, to avoid problems with cycle counters not being in sync across different cores/threads/what have you, you may find it necessary to set your process's affinity so that it only runs on one specific execution unit. (Then again... you may not.)

Thirdly, you'll definitely want to check the generated assembly language to ensure that the compiler is generating roughly the code you expect. Watch out for bits of code being removed, functions being inlined, that sort of thing.

Finally, the results are rather noisy. The cycle counters count cycles spent on everything, including waiting for caches, time spent on running other processes, time spent in the OS itself, etc. Unfortunately, it's not possible (under Windows, at least) to time just your process. So, I suggest running the code under test a lot of times (several tens of thousands) and working out the average. This isn't very cunning, but it seems to have produced useful results for me at any rate.

brone
hi, thanks for this snippet. I doubt it's practical value for this purpose, since apparently with WMI it *is* possible to measure only your process, but I pasted it in a simple c++ program and it does work as-is. On top of that, it's the first time I used inline assembler, because my assembler knowledge is rather bleak...
ufotds