views:

579

answers:

8

Say I have a virtual function call foo() on an abstract base class pointer, mypointer->foo(). When my app starts up, based on the contents of a file, it chooses to instantiate a particular concrete class and assigns mypointer to that instance. For the rest of the app's life, mypointer will always point to objects of that concrete type. I have no way to know what this concrete type is (it may be instantiated by a factory in a dynamically loaded library). I only know that the type will stay the same after the first time an instance of the concrete type is made. The pointer may not always point to the same object, but the object will always be of the same concrete type. Notice that the type is technically determined at 'runtime' because it's based on the contents of a file, but that after 'startup' (file is loaded) the type is fixed.

However, in C++ I pay the virtual function lookup cost every time foo is called for the entire duration of the app. The compiler can't optimize the look up away because there's no way for it to know that the concrete type won't vary at runtime (even if it was the most amazing compiler ever, it can't speculate on the behavior of dynamically loaded libraries). In a JIT compiled language like Java or .NET the JIT can detect that the same type is being used over and over and do inline cacheing. I'm basically looking for a way to manually do that for specific pointers in C++.

Is there any way in C++ to cache this lookup? I realize that solutions might be pretty hackish. I'm willing to accept ABI/compiler specific hacks if it's possible to write configure tests that discover the relevant aspects of the ABI/compiler so that it's "practically portable" even if not truly portable.

Update: To the naysayers: If this wasn't worth optimizing, then I doubt modern JITs would do it. Do you think Sun and MS's engineers were wasting their time implementing inline cacheing, and didn't benchmark it to ensure there was an improvement?

+4  A: 

So assuming that this is a fundamental issue you want to solve (to avoid premature optimization arguments), and ignoring platform and compiler specific hackery, you can do one of two things, at opposite ends of complexity:

  1. Provide a function as part of the .dll that internally simply calls the right member function directly. You pay the cost of an indirect jump, but at least you don't pay the cost of a vtable lookup. Your mileage may vary, but on certain platforms, you can optimize the indirect function call.
  2. Restructure your application such that instead of calling a member function per instance, you call a single function that takes a collection of instances. Mike Acton has a wonderful post (with a particular platform and application type bent) on why and how you should do this.
MSN
+17  A: 

There are two costs to a virtual function call: The vtable lookup and the function call.

The vtable lookup is already taken care of by the hardware. Modern CPUs (assuming you're not working on a very simple embedded CPU) will predict the address of the virtual function in their branch predictor and speculatively execute it in parallel with the array lookup. The fact that the vtable lookup happens in parallel with the speculative execution of the function means that, when executed in a loop in the situations you describe, virtual function calls have next to zero overhead compared to direct, non-inlined function calls.

I've actually tested this in the past, albeit in the D programming language, not C++. When inlining was disabled in the compiler settings and I called the same function in a loop several million times, the timings were within epsilon of each other whether the function was virtual or not.

The second and more important cost of virtual functions is that they prevent inlining of the function in most cases. This is even more important than it sounds because inlining is an optimization that can enable several other optimizations such as constant folding in some cases. There's no way to inline a function without recompiling the code. JITs get around this because they're constantly recompiling code during the execution of your application.

dsimcha
Do we even have to worry about the vtable lookup in a loop situation? I'm thinking that in a loop, where the object pointer doesn't change, won't the compiler optimize the vtable lookup out of the loop? If the object pointer doesn't change, then the object (and it's type and vtable) can't change, so the results of the vtable lookup can't change. This isn't an app-wide optimization, but if it's doing it, then each loop will only have to do the lookup once, which for most apps should be more than good enough.
Michael Kohne
@Michael Kohne: I can't speak for every compiler, but based on reading disassemblies from the Digital Mars D compiler, it doesn't look like this happens. In theory, one could put the function pointer in a register, etc. I actually hacked together some assembly language code to do this one time and it wasn't any faster, probably because with speculative execution you're effectively not paying for the array lookup anyhow.
dsimcha
In all other cases, the vtable will stay cached and the cost of going into the cache is next to nothing. Instead of trying to optimize cache hits, you should focus on cache misses. Every single cache miss will block the CPU for hundreds of cycles.
edgar.holleis
Your benchmark was not realistic. How well does speculative execution fare when the branch predictor actually has other things to worry about? When we're several layers of conditionally indirect calls deep? Also, your answer technically doesn't address the question.
Joseph Garvin
@Joseph Garvin: If you've got that much other logic between executions of a given call (e.g. if the call is in an outer loop, not an inner loop, or if the function you're calling calls several other functions) then that virtual function call is **not going to be a practically important bottleneck** even if it's mispredicted every time.
dsimcha
My argument is that if you have virtual calls making virtual calls, then at some point the constant overhead has to sum to a significant amount. I can see this happening in a dependency injection framework, where one injected piece call a virtual method on another injected piece, ad nauseum. After some depth you have to exhaust the speculators ability to execute things in parallel.
Joseph Garvin
@dsimcha: Why do modern VMs do this if it is never an optimization?
Joseph Garvin
@Joseph Garvin: Probably because they also want to inline the function.
dsimcha
@dsimchar, you said that you called *the same function* in a loop. In this case, even if the function is virtual, CPU can perfectly predict the branch target. So, you saw very small difference. However, if the target is hard to predict, then it is quite different.
minjang
@minjang: Right, if the target is hard to predict, you get different results. However, unpredictable branches of any kind are a nightmare for micro-optimization, and usually represent essential complexity in the code. The OP's question implied that the branches were predictable.
dsimcha
@dsimchar, even the targets are easy to predict, modern CPUs may not handle well for indirect branches. Direct branches are quite well predict. However, if the targets are correlated with some patterns, then it's still hard to predict. And it is still on-going work for CPU architects.
minjang
+2  A: 

I have seen situations where avoiding a virtual function call is beneficial. This does not look to me to be one of those cases because you really are using the function polymorphically. You are just chasing one extra address indirection, not a huge hit, and one that might be partially optimized away in some situations. If it really does matter, you may want to restructure your code so that type-dependent choices such as virtual function calls are made fewer times, pulled outside of loops.

If you really think it's worth giving it a shot, you can set a separate function pointer to a non-virtual function specific to the class. I might (but probably wouldn't) consider doing it this way.

class MyConcrete : public MyBase
{
public:
  static void foo_nonvirtual(MyBase* obj);
  virtual void foo()
  { foo_nonvirtual(this); }
};

void (*f_ptr)(MyBase* obj) = &MyConcrete::foo_nonvirtual;
// Call f_ptr instead of obj->foo() in your code.
// Still not as good a solution as restructuring the algorithm.

Other than making the algorithm itself a bit wiser, I suspect any attempt to manually optimize the virtual function call will cause more problems than it solves.

Darryl
"This does not look to me to be one of those cases because you really are using the function polymorphically." <-- Sort of. It's polymorphic until startup is over, but monomorphic after that.
Joseph Garvin
+2  A: 

All answers are dealing with the most simple scenario, where calling a virtual method only requires getting the address of the actual method to call. In the general case, when multiple and virtual inheritance come into play, calling a virtual method requires shifting the this pointer.

The method dispatch mechanism can be implemented in more than one way, but it is common to find that the entry in the virtual table is not the actual method to call, but rather some intermediate 'trampoline' code inserted by the compiler that relocates the this pointer prior to calling the actual method.

When the dispatch is the simplest, just an extra pointer redirection, then trying to optimize it does not make sense. When the problem is more complex, then any solution will be compiler dependent and hackerish. Moreover, you do not even know in what scenario you are: if the objects are loaded from dlls then you don't really know whether the actual instance returned belongs to a simple linear inheritance hierarchy or a more complex scenario.

David Rodríguez - dribeas
+4  A: 

Why virtual call is expensive? Because you simply don't know the branch target until the code is executed in runtime. Even modern CPUs are still perfectly handling the virtual call and indirect calls. One can't simply say it costs nothing because we just have a faster CPU. No, it is not.

1. How can we make it fast?

You already have pretty deep understanding the problem. But, the only I can say that if the virtual function call is easy to predict, then you could perform software-level optimization. But, if it's not (i.e., you have really no idea what would be the target of the virtual function), then I don't think that there is good solution for now. Even for CPU, it is hard to predict in such extreme case.

Actually, compilers such as Visual C++'s PGO(Profiling guided optimization) has virtual call speculation optimization (Link). If the profiling result can enumerate hot virtual function targets, then it translate to direct call which can be inlined. This is also called devirtualization. It can be also found in some Java dynamic optimizer.

2. To those one who say it's not necessary

If you're using script languages, C# and concern about the coding efficiency, yes, it's worthless. However, anyone who are eager to save a single cycle to obtain better performance, then indirect branch is still important problem. Even the latest CPUs are not good to handle virtual calls. One good example would be a virtual machine or interpreter, which usually have a very large switch-case. Its performance is pretty much related to the correct prediction of indirect branch. So, you can't simply say it's too low-level or not necessary. There are hundreds of people who are trying to improve the performance in the bottom. That's why you can simply ignore such details :)

3. Some boring computer architectural facts related to virtual functions

dsimcha has written a good answer for how CPU can handle virtual call effectively. But, it's not exactly correct. First, all modern CPUs have branch predictor, which literally predicts the outcomes of a branch to increase pipeline throughput (or, more parallelism in instruction level, or ILP. I can even say that single-thread CPU performance is solely depending on how much you can extract ILP from a single thread. Branch prediction is the most critical factor for obtaining higher ILP).

In branch prediction, there are two predictions: (1) direction (i.e., the branch is taken? or not taken? binary answer), and (2) branch target (i.e., where will I go? it's not binary answer). Based on the prediction, CPU speculatively execute the code. If the speculation is not correct, then CPU rollbacks and restarts from the mis-predicted branch. This is completely hidden from programmer's view. So, you don't really know what's going on inside the CPU unless you're profiling with VTune which gives branch misprediction rates.

In general, branch direction prediction is highly accurate(95%+), but it is still hard to predict branch targets, especially virtual calls and switch-case(i.e., jump table). Vrtual call is indirect branch which requires a more memory load, and also CPU requires branch target prediction. Modern CPUs like Intel's Nehalem and AMD's Phenom have specialized indirect branch target table.

However, I don't think looking up vtable incurs a lot of overhead. Yes, it requires a more memory load which can make cache miss. But, once vtable is loaded into cache, then it's pretty much cache hit. If you're also concerned with that cost, you may put prefetching code to load vtable in advance. But, the real difficulty of virtual function call is that CPU can't do great job to predict the target of virtual call, which may result in pipeline drain frequently due to misprediction of the target.

minjang
+1  A: 

Could you use a method pointer?

The objective here is that the compiler would load the pointer with the location of the resolved method or function. This would occur once. After the assignment, the code would access the method in a more direct fashion.

I know that a pointer to an object and accessing the method via the object point invokes run-time polymorphism. However, there should be a way to load a method pointer to a resolved method, avoiding the polymorphism and directly calling the function.

I've checked the community wiki to introduce more discussion.

Thomas Matthews
This has the same problem that most other answers: the compiler does not only need to determine the actual method (block of code) to call, but also modify the this pointer accordingly. The scenario is not as simple as most people consider here.
David Rodríguez - dribeas
+1  A: 

You can't use a method pointer because pointers to member functions aren't considered covariant return types. See the example below:

#include <iostream>

struct base;
struct der;

typedef void(base::*pt2base)();
typedef void(der::*pt2der)();

struct base {
    virtual pt2base method() = 0;
    virtual void testmethod() = 0;
    virtual ~base() {}
};

struct der : base {
    void testmethod() {
        std::cout << "Hello from der" << std::endl;
    }
    pt2der method() { **// this is invalid because pt2der isn't a covariant of pt2base**
        return &der::testmethod;
    }
};

The other option would be to have the method declared pt2base method() but then the return would be invalid because der::testmethod is not of type pt2base.

Also even if you had a method that received a ptr or reference to the base type you would have to dynamically cast it to the derived type in that method to do anything particularly polymorphic which adds back in the cost we're trying to save.

DerrickH
Wow, I'm still learning C++ esoterica o_O
Joseph Garvin
+1  A: 

So, what you basically want to do is convert runtime polymorphism into compile time polymorphism. Now you still need to build your app so that it can handle multiple "cases", but once it's decided which case is applicable to a run, that's it for the duration.

Here's a model of the runtime polymorphism case:

struct Base {
  virtual void doit(int&)=0;
};

struct Foo : public Base {
  virtual void doit(int& n) {--n;}
};

struct Bar : public Base {
  virtual void doit(int& n) {++n;}
};

void work(Base* it,int& n) {
  for (unsigned int i=0;i<4000000000u;i++) it->doit(n);
}

int main(int argc,char**) {
  int n=0;

  if (argc>1)
    work(new Foo,n);
  else
    work(new Bar,n);

  return n;
}

This takes ~14s to execute on my Core2, compiled with gcc 4.3.2 (32 bit Debian), -O3 option.

Now suppose we replace the "work" version with a templated version (templated on the concrete type it's going to be working on):

template <typename T> void work(T* it,int& n) {
  for (unsigned int i=0;i<4000000000u;i++) it->T::doit(n);
}

main doesn't actually need to be updated, but note that the 2 calls to work now trigger instantiations of and calls to two different and type-specific functions (c.f the one polymorphic function previously).

Hey presto runs in 0.001s. Not a bad speed up factor for a 2 line change! However, note that the massive speed up is entirely due to the compiler, once the possibility of runtime polymorphism in the work function is eliminated, just optimizing away the loop and compiling the result directly into the code. But that actually makes an important point: in my experience the main gains from using this sort of trick come from the opportunities for improved inlining and optimisation they allow the compiler when a less-polymorphic, more specific function is generated, not from the mere removal of vtable indirection (which really is very cheap).

But I really don't recommend doing stuff like this unless profiling absolutely indicates runtime polymorphism is really hitting your performance. It'll also bite you as soon as someone subclasses Foo or Bar and tries to pass that into a function actually intended for its base.

You might find this related question interesting too.

timday