views:

967

answers:

11

Hi,

< backgound>

I'm at a point where I really need to optimize C++ code. I'm writing a library for molecular simulations and I need to add a new feature. I already tried to add this feature in the past, but I then used virtual functions called in nested loops. I had bad feelings about that and the first implementation proved that this was a bad idea. However this was OK for testing the concept.

< /background>

Now I need this feature to be as fast as possible (well without assembly code or GPU calculation, this still has to be C++ and more readable than less). Now I know a little bit more about templates and class policies (from Alexandrescu's excellent book) and I think that a compile-time code generation may be the solution.

However I need to test the design before doing the huge work of implementing it into the library. The question is about the best way to test the efficiency of this new feature.

Obviously I need to turn optimizations on because without this g++ (and probably other compilers as well) would keep some unnecessary operations in the object code. I also need to make a heavy use of the new feature in the benchmark because a delta of 1e-3 second can make the difference between a good and a bad design (this feature will be called million times in the real program).

The problem is that g++ is sometimes "too smart" while optimizing and can remove a whole loop if it consider that the result of a calculation is never used. I've already seen that once when looking at the output assembly code.

If I add some printing to stdout, the compiler will then be forced to do the calculation in the loop but I will probably mostly benchmark the iostream implementation.

So how can I do a correct benchmark of a little feature extracted from a library ? Related question: is it a correct approach to do this kind of in vitro tests on a small unit or do I need the whole context ?

Thanks for advices !


There seem to be several strategies, from compiler-specific options allowing fine tuning to more general solutions that should work with every compiler like volatile or extern.

I think I will try all of these. Thanks a lot for all your answers!

+1  A: 

Unless you have a really aggressive compiler (can happen), I'd suggest calculating a checksum (simply add all the results together) and output the checksum.

Other than that, you might want to look at the generated assembly code before running any benchmarks so you can visually verify that any loops are actually being run.

Jasper Bekkers
+1  A: 

Compilers are only allowed to eliminate code-branches that can not happen. As long as it cannot rule out that a branch should be executed, it will not eliminate it. As long as there is some data dependency somewhere, the code will be there and will be run. Compilers are not too smart about estimating which aspects of a program will not be run and don't try to, because that's a NP problem and hardly computable. They have some simple checks such as for if (0), but that's about it.

My humble opinion is that you were possibly hit by some other problem earlier on, such as the way C/C++ evaluates boolean expressions.

But anyways, since this is about a test of speed, you can check that things get called for yourself - run it once without, then another time with a test of return values. Or a static variable being incremented. At the end of the test, print out the number generated. The results will be equal.

To answer your question about in-vitro testing: Yes, do that. If your app is so time-critical, do that. On the other hand, your description hints at a different problem: if your deltas are in a timeframe of 1e-3 seconds, then that sounds like a problem of computational complexity, since the method in question must be called very, very often (for few runs, 1e-3 seconds is neglectible).

The problem domain you are modeling sounds VERY complex and the datasets are probably huge. Such things are always an interesting effort. Make sure that you absolutely have the right data structures and algorithms first, though, and micro-optimize all you want after that. So, I'd say look at the whole context first. ;-)

Out of curiosity, what is the problem you are calculating?

mstrobl
The deltas were for one function call since this function will be called a few million times.I'm working in the domain of macromolecular interactions (lots of atoms, translations, rotations...). The new code must modify coordinates of every atom before any translation/rotation.
ascobol
+1  A: 

You have a lot of control on the optimizations for your compilation. -O1, -O2, and so on are just aliases for a bunch of switches.

From the man pages

       -O2 turns on all optimization flags specified by -O.  It also turns
       on the following optimization flags: -fthread-jumps -falign-func‐
       tions  -falign-jumps -falign-loops  -falign-labels -fcaller-saves
       -fcrossjumping -fcse-follow-jumps  -fcse-skip-blocks
       -fdelete-null-pointer-checks -fexpensive-optimizations -fgcse
       -fgcse-lm -foptimize-sibling-calls -fpeephole2 -fregmove -fre‐
       order-blocks  -freorder-functions -frerun-cse-after-loop
       -fsched-interblock  -fsched-spec -fschedule-insns  -fsched‐
       ule-insns2 -fstrict-aliasing -fstrict-overflow -ftree-pre
       -ftree-vrp

You can tweak and use this command to help you narrow down which options to investigate.

       ...
       Alternatively you can discover which binary optimizations are
       enabled by -O3 by using:

               gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
               gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
               diff /tmp/O2-opts /tmp/O3-opts Φ grep enabled

Once you find the culpret optimization you shouldn't need the cout's.

J.J.
The problem is that exactly this optimization might be desirable in another place, or even crucial for the overall performance of the template.
Konrad Rudolph
+1  A: 

If this is possible for you, you might try splitting your code into:

  • the library you want to test compiled with all optimizations turned on
  • a test program, dinamically linking the library, with optimizations turned off

Otherwise, you might specify a different optimization level (it looks like you're using gcc...) for the test functio n with the optimize attribute (see http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#Function-Attributes).

Paolo Tedesco
A: 

I don't know if GCC has a similar feature, but with VC++ you can use:

#pragma optimize

to selectively turn optimizations on/off. If GCC has similar capabilities, you could build with full optimization and just turn it off where necessary to make sure your code gets called.

Ferruccio
+1  A: 

You could create a dummy function in a separate cpp file that does nothing, but takes as argument whatever is the type of your calculation result. Then you can call that function with the results of your calculation, forcing gcc to generate the intermediate code, and the only penalty is the cost of invoking a function (which shouldn't skew your results unless you call it a lot!).

Tony
+1 - The bottom line: find the least-impacting piece of code that you can add to your benchmarking loop to trick the compiler into believing that it's actually doing something. For example, add the iteration index etc. to a dummy checksum variable that gets output/passed up.
Ates Goral
Isn't this shifting the problem? couldn't the compiler still think that something is irrelevant to the computation of the return value and optimize it away?
Paolo Tedesco
A: 

Just a small example of an unwanted optimization:

#include <vector>
#include <iostream>

using namespace std;

int main()
{
double coords[500][3];

//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
 {
   coords[i][0] = 3.23;
   coords[i][1] = 1.345;
   coords[i][2] = 123.998;
 }


cout << "hello world !"<< endl;
return 0;
}

If you comment the code from "double coords[500][3]" to the end of the for loop it will generate exactly the same assembly code (just tried with g++ 4.3.2). I know this example is far too simple, and I wasn't able to show this behavior with a std::vector of a simple "Coordinates" structure.

However I think this example still shows that some optimizations can introduce errors in the benchmark and I wanted to avoid some surprises of this kind when introducing new code in a library. It's easy to imagine that the new context might prevent some optimizations and lead to a very inefficient library.

The same should also apply with virtual functions (but I don't prove it here). Used in a context where a static link would do the job I'm pretty confident that decent compilers should eliminate the extra indirection call for the virtual function. I can try this call in a loop and conclude that calling a virtual function is not such a big deal. Then I'll call it hundred of thousand times in a context where the compiler cannot guess what will be the exact type of the pointer and have a 20% increase of running time...

ascobol
Looks like a good optimization to me! coords[] is never used and has no side effects when assigned too. Just because the compiler is smarter than you don't complain (It will NEVER remove stuff that you need, use or has a side effect).
Martin York
Of course this a a good job! I wanted to show how difficult it is to try a feature out of the context: if I reduce the problem to the smallest thing I want to test, the compiler may do nothing at all or let me hope for some optimizations that it won't be able to do in the real life.
ascobol
+1  A: 
#include <iostream>

// Mark coords as extern.
// Compiler is now NOT allowed to optimise away coords
// This it can not remove the loop where you initialise it.
// This is because the code could be used by another compilation unit
extern double coords[500][3];
double coords[500][3];

int main()
{

//perform a simple initialization of all coordinates:
for (int i=0; i<500; ++i)
 {
   coords[i][0] = 3.23;
   coords[i][1] = 1.345;
   coords[i][2] = 123.998;
 }


std::cout << "hello world !"<< std::endl;
return 0;
}
Martin York
A: 

at startup, read from a file. in your code, say if(input == "x") cout<< result_of_benchmark;

The compiler will not be able to eliminate the calculation, and if you ensure the input is not "x", you won't benchmark the iostream.

+1  A: 

edit: the easiest thing you can do is simply use the data in some spurious way after the function has run and outside your benchmarks. Like,

StartBenchmarking(); // ie, read a performance counter
for (int i=0; i<500; ++i)
 {
   coords[i][0] = 3.23;
   coords[i][1] = 1.345;
   coords[i][2] = 123.998;
 }
StopBenchmarking(); // what comes after this won't go into the timer

// this is just to force the compiler to use coords
double foo;
for (int j = 0 ; j < 500 ; ++j )
{
  foo += coords[j][0] + coords[j][1] + coords[j][2]; 
}
cout << foo;


What sometimes works for me in these cases is to hide the in vitro test inside a function and pass the benchmark data sets through volatile pointers. This tells the compiler that it must not collapse subsequent writes to those pointers (because they might be eg memory-mapped I/O). So,

void test1( volatile double *coords )
{
  //perform a simple initialization of all coordinates:
  for (int i=0; i<1500; i+=3)
  {
    coords[i+0] = 3.23;
    coords[i+1] = 1.345;
    coords[i+2] = 123.998;
  }
}

For some reason I haven't figured out yet it doesn't always work in MSVC, but it often does -- look at the assembly output to be sure. Also remember that volatile will foil some compiler optimizations (it forbids the compiler from keeping the pointer's contents in register and forces writes to occur in program order) so this is only trustworthy if you're using it for the final write-out of data.

In general in vitro testing like this is very useful so long as you remember that it is not the whole story. I usually test my new math routines in isolation like this so that I can quickly iterate on just the cache and pipeline characteristics of my algorithm on consistent data.

The difference between test-tube profiling like this and running it in "the real world" means you will get wildly varying input data sets (sometimes best case, sometimes worst case, sometimes pathological), the cache will be in some unknown state on entering the function, and you may have other threads banging on the bus; so you should run some benchmarks on this function in vivo as well when you are finished.

Crashworks
+2  A: 

If you want to force any compiler to not discard a result, have it write the result to a volatile object. That operation cannot be optimized out, by definition.

template<typename T> void sink(T const& t) {
   volatile T sinkhole = t;
}

No iostream overhead, just a copy that has to remain in the generated code. Now, if you're collecting results from a lot of operations, it's best not to discard them one by one. These copies can still add some overhead. Instead, somehow collect all results in a single non-volatile object (so all individual results are needed) and then assign that result object to a volatile. E.g. if your individual operations all produce strings, you can force evaluation by adding all char values together modulo 1<<32. This adds hardly any overhead; the strings will likely be in cache. The result of the addition will subsequently be assigned-to-volatile so each char in each sting must in fact be calculated, no shortcuts allowed.

MSalters