views:

1318

answers:

15

In an AI application I am writing in C++,

  1. there is not much numerical computation
  2. there are lot of structures for which run-time polymorphism is needed
  3. very often, several polymorphic structures interact during computation

In such a situation, are there any optimization techniques? While I won't care to optimize the application just now, one aspect of selecting C++ over Java for the project was to enable more leverage to optimize and to be able to use non-object oriented methods (templates, procedures, overloading).

In particular, what are the optimization techniques related to virtual functions? Virtual functions are implemented through virtual tables in memory. Is there some way to pre-fetch these virtual tables onto L2 cache (the cost of fetching from memory/L2 cache is increasing)?

Apart from this, are there good references for data locality techniques in C++? These techniques would reduce the wait time for data fetch into L2 cache needed for computation.

Update: Also see the following related forums: Performance Penalty for Interface, Several Levels of Base Classes

+1  A: 

You rarely have to worry about cache in regards to such commonly used items, since they're fetched once and kept there.

Cache is only generally an issue when dealing with large data structures that either:

  1. Are large enough and used for a very long time by a single function so that function can push everything else you need out of the cache, or
  2. Are randomly accessed enough that the data structures themselves aren't necessarily in cache when you load from them.

Things like Vtables are generally not going to be a performance/cache/memory issue; usually there's only one Vtable per object type, and the object contains a pointer to the Vtable instead of the Vtable itself. So unless you have a few thousand types of objects, I don't think Vtables are going to thrash your cache.

1), by the way, is why functions like memcpy use cache-bypassing streaming instructions like movnt(dq|q) for extremely large (multi-megabyte) data inputs.

Dark Shikari
A: 

If an AI application does not require great deal of number crunching, I wouldn't worry about performance disadvantage of virtual functions. There will be a marginal performance hit, only if they appear in the complex computations which are evaluated repeatedly. I don't think you can force virtual table to stay in L2 cache either.

There are a couple of optimizations available for virtual functions,

  1. People have written compilers that resort to code analysis and transformation of the program. But, these aren't a production grade compilers.
  2. You could replace all virtual functions with equivalent "switch...case" blocks to call appropriate functions based on the type in the hierarchy. This way you'll get rid of compiler managed virtual table and you'll have your own virtual table in the form of switch...case block. Now, chances of your own virtual table being in the L2 cache are high as it in the code path. Remember, you'll need RTTI or your own "typeof" function to achieve this.
nikhilbelsare
+4  A: 

Have you actually profiled and found where, and what needs optimization?

Work on actually optimizing virtual function calls when you have found they actually are the bottleneck.

Chris Mayer
+22  A: 

Virtual functions are very efficient. Assuming 32 bit pointers the memory layout is approximately:

classptr -> [vtable:4][classdata:x]
vtable -> [first:4][second:4][third:4][fourth:4][...]
first -> [code:x]
second -> [code:x]
...

The classptr points to memory that is typically on the heap, occasionally on the stack, and starts with a four byte pointer to the vtable for that class. But the important thing to remember is the vtable itself is not allocated memory. It's a static resource and all objects of the same class type will point to the exactly the same memory location for their vtable array. Calling on different instances won't pull different memory locations into L2 cache.

This example from msdn shows the vtable for class A with virtual func1, func2, and func3. Nothing more than 12 bytes. There is a good chance the vtables of different classes will also be physically adjacent in the compiled library (you'll want to verify this is you're especially concerned) which could increase cache efficiency microscopically.

CONST SEGMENT
??_7A@@6B@
   DD  FLAT:?func1@A@@UAEXXZ
   DD  FLAT:?func2@A@@UAEXXZ
   DD  FLAT:?func3@A@@UAEXXZ
CONST ENDS

The other performance concern would be instruction overhead of calling through a vtable function. This is also very efficient. Nearly identical to calling a non-virtual function. Again from the example from msdn:

; A* pa;
; pa->func3();
mov eax, DWORD PTR _pa$[ebp]
mov edx, DWORD PTR [eax]
mov ecx, DWORD PTR _pa$[ebp]
call  DWORD PTR [edx+8]

In this example ebp, the stack frame base pointer, has the variable A* pa at zero offset. The register eax is loaded with the value at location [ebp], so it has the A*, and edx is loaded with the value at location [eax], so it has class A vtable. Then ecx is loaded with [ebp], because ecx represents "this" it now holds the A*, and finally the call is made to the value at location [edx+8] which is the third function address in the vtable.

If this function call was not virtual the mov eax and mov edx would not be needed, but the difference in performance would be immeasurably small.

loudej
Nice example. Does anyone else get itchy from the redundant memory hit right before the call when a mov ecx, eax would do it?
plinth
On a modern CPU the movs might be completely pipelined away.
Jimmy J
It's important to note that while the instruction overhead of calling a virtual function is minimal that they can negatively affect performance in other ways. For one thing they prevent many optimizations such as inlining and branch prediction. For another they can harm I-Cache performance
Andrew Grant
And what about the pipeline bubble created by a branch mispredict on an indirect jump?
Crashworks
+1  A: 

You can implement polymorfism in runtime using virtual functions and in compile time by using templates. You can replace virtual functions with templates. Take a look at this article for more information - http://www.codeproject.com/KB/cpp/SimulationofVirtualFunc.aspx

+1  A: 

A solution to dynamic polymorphism could be static polymmorphism, usable if your types are known at compile type: The CRTP (Curiously recurring template pattern).

http://en.wikipedia.org/wiki/Curiously_recurring_template_pattern

The explanation on Wikipedia is clear enough, and perhaps It could help you if you really determined virtual method calls were source of performance bottlenecks.

paercebal
+2  A: 

Virtual calls do not present much greater overhead over normal functions. Although the greatest loss is that a virtual function, when called polymorphicaly cannot be inlined. And inlining will in a lot of situations represent some real gain in performance.

Somethign You can do to prevent wastage of that facility in some situatiosn is declare the function inline virtual.

Class A {
   inline virtual int foo() {...}
};

And when you are at a point of code you are SURE about the type of the objexct beign called, you may make an inline call that will avoid the polimorphic sytem and enable inlining by the compiler.

class B : public A {
     inline virtual int foo() 
     {
         //...do soemthign different
     }

     void bar()
     {
      //logic...
      B::foo();
      // more  logic
     }
};

In this example the  call to foo will be made non polymorphic and bound to B implementation of foo.  But do it only when you know for sure what the instance type is, because  the automatic polymorphism feature will be gone and this is not ver obvious for later code readers.
OldMan
+4  A: 

The only optimization I can think of is Java's JIT compiler. If I understand it correctly, it monitors the calls as the code runs, and if most calls go to particular implementation only, it inserts conditional jump to implementation when the class is right. This way, most of the time, there is no vtable lookup. Of course, for the rare case when we pass a different class, vtable is still used.

I am not aware of any C++ compiler/runtime that uses this technique.

Arkadiy
This is *very* important for Java where every method is virtual by default. C++ has the programmer do his own profiling and direct call fixes.
Zan Lynx
+6  A: 

Section 5.3.3 of the draft Technical Report on C++ Performance is entirely devoted to the overhead of virtual functions.

Xavier Nodet
+1  A: 

The cost is more or less the same than normal functions nowadays for recent CPUS, but they can't be inlined. If you call the function millions times, the impact can be significant (try calling millions of times the same function, for example, once with inline once without, and you will see it can be twice slower if the function itself does something simple; this is not a theoritical case: it is quite common for a lot of numerical computation).

David Cournapeau
+3  A: 

I'm reinforcing all answers that say in effect:

  • If you don't actually know it's a problem, any concern about fixing it is probably misplaced.

What you want to know is:

  • What fraction of execution time (when it's actually running) is spent in the process of invoking methods, and in particular, which methods are the most costly (by this measure).

Some profilers can give you this information indirectly. They need to summarize at the statement level, but exclusive of the time spent in the method itself.

My favorite technique is to just pause it a number of times under a debugger.

If the time spent in the process of virtual function invocations is significant, like say 20%, then on the average 1 out of 5 samples will show, at the bottom of the call stack, in the disassembly window, the instructions for following the virtual function pointer.

If you don't actually see that, it is not a problem.

In the process, you will probably see other things higher up the call stack, that actually are not needed and could save you a lot of time.

Mike Dunlavey
+3  A: 

Virtual functions tend to be a lookup and indirection function call. On some platforms, this is fast. On others, e.g., one popular PPC architecture used in consoles, this isn't so fast.

Optimizations usually revolve around expressing variability higher up in the callstack so that you don't need to invoke a virtual function multiple times within hotspots.

MSN
+2  A: 

As already stated by the other answers, the actual overhead of a virtual function call is fairly small. It may make a difference in a tight loop where it is called millions of times per second, but it's rarely a big deal.

However, it may still have a bigger impact in that it's harder for the compiler to optimize. It can't inline the function call, because it doesn't know at compile-time which function will be called. That also makes some global optimizations harder. And how much performance does this cost you? It depends. It is usually nothing to worry about, but there are cases where it may mean a significant performance hit.

And of course it also depends on the CPU architecture. On some, it can become quite expensive.

But it's worth keeping in mind that any kind of runtime polymorphism carries more or less the same overhead. Implementing the same functionality via switch statements or similar, to select between a number of possible functions may not be cheaper.

The only reliable way to optimize this would be if you could move some of the work to compile-time. If it is possible to implement part of it as static polymorphism, some speedup may be possible.

But first, make sure you have a problem. Is the code actually too slow to be acceptable? Second, find out what makes it slow through a profiler. And third, fix it.

jalf
I would only add that optimization is way over-rated. It only matters in code where 1) the PC spends significant time (which it probably won't if it contains calls), and 2) you actually compile (i.e. not library routines).
Mike Dunlavey
+1  A: 

With modern, ahead-looking, multiple-dispatching CPUs the overhead for a virtual function might well be zero. Nada. Zip.

Jimmy J
It would be great if you could give a detailed example.
Amit Kumar
+1  A: 

Static polymorphism, as some users answered here. For example, WTL uses this method. A clear explanation of the WTL implementation can be found at http://www.codeproject.com/KB/wtl/wtl4mfc1.aspx#atltemplates

Hernán