ansaurus

Question

Where is the virtual function call overhead?

Answer 1

+7 A:

In both cases you are calling functions indirectly. In one case through your table of function pointers, and in the other through the compiler's array of function pointers (the vtable). Not surprisingly, two similar operations give you similar timing results.

anon 2010-05-16 14:32:26

Well, from what I know, for a pointer call you need to load the address of a function to call, and that's it, whereas for the vtable it's at least the address of the vtable and then the pointer. There should be some reason to avoid virtual function class, after all. And this was what was said in the documentation for the project I inherited, so I just tried to see the difference.

Semen Semenych 2010-05-16 14:44:45

@Semen Semenych: But once the vtable indirection has been cached by the CPU, the extra layer of indirection becomes extremely cheap. Performance-wise, the best reason to avoid vtable indirection is not "it's slow", but rather "it prevents some optimizations". It's not slow, it just prevents the code from being fast.

jalf 2010-05-16 22:56:31

Answer 2

+2 A:

Virtual functions may be slower than regular functions, but that's due to things like inlines. If you call a function through a function table, those can't be inlined either, and the lookup time is pretty much the same. Looking up through your own lookup table is of course going to be the same as looking up through the compiler's lookup table.
Edit: Or even slower, because the compiler knows a lot more than you about things like processor cache and such.

DeadMG 2010-05-16 14:41:44

Answer 3

+2 A:

Nowadays, on most systems, memory access is the primary bottleneck, and not the CPU. In many cases, there is little significant difference between virtual and non-virtual functions - they usually represent a very small portion of execution time. (Sorry, I don't have reported figures to back this up, just emprical data.)

If you want to get the best performance you will get more bang for your buck if you look into how to parallelize the computation to take advantage of multiple cores/processing units, rather than worring about micro-details of virtual vs non-virtual functions.

mdma 2010-05-16 14:58:22

Already doing that, thanks for the advice. So it's not the CPU. I'll have to talk to that guy who wrote the old code.

Semen Semenych 2010-05-16 15:00:14

No, that’s just not true in general. Virtual functions can incur a *huge* overhead because of failure to inline, failing branch predictions etc. And of course, even the mere added indirection can be costly in some scenarios (tight loops).

Konrad Rudolph 2010-05-16 15:05:01

Yes, this is the answer, gotta stop overcomplicating my design with member function pointers and templates and focus on the real stuff.

Semen Semenych 2010-05-16 15:05:45

If we're talking about inline here, then what will inline with guarantee : a function pointer, a member function pointer or is there something else which I don't know?

Semen Semenych 2010-05-16 15:07:49

@Konrad - I think "in general" it is true, and "for sepcific cases", it is not. For a tight inner loop then you want to avoid virtual functions, if this really is where the overhead is. But in general, the difference is not worth worrying about - and we are talking generally. If you profile all of an entire app's virtual functions, you'll find that tyically only a handful have any signifiant overhead.

mdma 2010-05-16 15:10:54

If you are worried about function call overhead, then add block-processing functions so that more than one array element can be processed at a time. Your block function can be virtual, but the inner function is called non-virtually, i.e MyClass::calc. Such a design also lends itself well to splitting the work into Tasks for handling by a task queue feeding multiple worker threads.

mdma 2010-05-16 15:17:04

That could work, if it were not a graph that can only be traversed sequentially. What I'm doing is modelling a circuit, with continuous functions instead of logical ones. Anyway thanks for the answer. I think I'll go for member function pointers to get inline.

Semen Semenych 2010-05-16 15:22:44

Ok, I don't know the details, but perhaps you could add nodes of your graph in sequence to a Vector and then process that. Either way, I would try to improve performance by a scalable design than try to improve performance incrementally using implementation tricks. Good luck, I hope you get the performance you are after in the end!

mdma 2010-05-16 15:50:46

@Konrad Rudolph: True. But when function pointers are stored in tables they are just as incapable as being in-lined as virtual functions.

Martin York 2010-05-16 19:31:09

@Martin York: I didn’t argue that point.

Konrad Rudolph 2010-05-16 20:01:27

Answer 4

+2 A:

Many people fall into the habit of doing things just because they are thought to be "faster". It's all relative.

If I'm going to take a 100-mile drive from my home, I have to start by driving around the block. I can drive around the block to the right, or to the left. One of those will be "faster". But will it matter? Of course not.

In this case, the functions that you call are in turn calling math functions.

If you pause the program under the IDE or GDB, I suspect you will find that nearly every time you pause it it will be in those math library routines (or it should be!), and dereferencing an additional pointer to get there (assuming it doesn't bust a cache) should be lost in the noise.

Added: Here is a favorite video: Harry Porter's relay computer. As that thing laboriously clacks away adding numbers and stepping its program counter, I find it helpful to be mindful that that's what all computers are doing, just on a different scale of time and complexity. In your case, think about an algorithm to do sin, cos, tan, or sqrt. Inside, it is clacking away doing those things, and only incidentally following addresses or messing with a really slow memory to get there.

Mike Dunlavey 2010-05-16 16:00:06

Answer 5

A:

And finally, the function pointer approach has turned out to be the fastest one. Which was what I'd expected from the very beginning.

Semen Semenych 2010-05-16 16:28:21

Answer 6

+2 A:

I think you're seeing the difference, but it's just the function call overhead. Branch misprediction, memory access and the trig functions are the same in both cases. Compared to those, it's just not that big a deal, though the function pointer case was definitely a bit quicker when I tried it.

If this is representative of your larger program, this is a good demonstration that this type of microoptimization is sometimes just a drop in the ocean, and at worst futile. But leaving that aside, for a clearer test, the functions should perform some simpler operation, that is different for each function:

void function_not( double *d ) {
    *d = 1.0;
}

void function_and( double *d ) {
    *d = 2.0;
}

And so on, and similarly for the virtual functions.

(Each function should do something different, so that they don't get elided and all end up with the same address; that would make the branch prediction work unrealistically well.)

With these changes, the results are a bit different. Best of 4 runs in each case. (Not very scientific, but the numbers are broadly similar for larger numbers of runs.) All timings are in cycles, running on my laptop. Code was compiled with VC++ (only changed the timing) but gcc implements virtual function calls in the same way so the relative timings should be broadly similar even with different OS/x86 CPU/compiler.

Function pointers: 2,052,770

Virtuals: 3,598,039

That difference seems a bit excessive! Sure enough, the two bits of code aren't quite the same in terms of their memory access behaviour. The second one should have a table of 4 A *s, used to fill in base, rather than new'ing up a new one for each entry. Both examples will then have similar behaviour (1 cache miss/N entries) when fetching the pointer to jump through. For example:

A *tbl[4] = { new A1, new A2, new A3, new A4 };
for ( long int i = 0; i < 100000; ++i ) {
    array[i] = ( double )( rand() / 1000 );
    base[i] = tbl[ rand() % 4 ];
}

With this in place, still using the simplified functions:

Virtuals (as suggested here): 2,487,699

So there's 20%, best case. Close enough?

So perhaps your colleague was right to at least consider this, but I suspect that in any realistic program the call overhead won't be enough of a bottleneck to be worth jumping through hoops over.

brone 2010-05-16 18:24:05

While most is clear, I still fail to see any branches in the benchmarked section of the code for both variants. Is it the trigonometrical functions?And I didn't know anything about the differences between allocating A's this or that way, will have to look into that.There are have two more variants, one with a switch inside the measured code and one with member function pointers. The last one is the fastest, though I'll have to check with simplified functions.As far as 20% go, that's exactly what the article said, can't find it though, that was a PDF somewhere on the internet.

Semen Semenych 2010-05-16 19:06:35

The branch here is the place where the target function is called - "index_array[i]( ", "base[i]->calculate( ". On modern x86 CPUs these benefit from prediction (albeit of a slightly different kind) just like conditional branches. Sorry if my terminology was a bit confusing.

brone 2010-05-16 20:01:08

Also the speed difference between the two ways of allocating A is down to the data cache. See http://www.agner.org/optimize/, "Optimizing C++" pdf, there's a section about it in there.

brone 2010-05-16 20:05:35

ansaurus

tags:

views:

answers:

Where is the virtual function call overhead?

related questions