views:

238

answers:

6

Hello everybody, I'm trying to benchmark the difference between a function pointer call and a virtual function call. To do this, I have written two pieces of code, that do the same mathematical computation over an array. One variant uses an array of pointers to functions and calls those in a loop. The other variant uses an array of pointers to a base class and calls its virtual function, which is overloaded in the derived classes to do absolutely the same thing as the functions in the first variant. Then I print the time elapsed and use a simple shell script to run the benchmark many times and compute the average run time.

Here is the code:

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>

using namespace std;

long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
    ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}

void function_not( double *d ) {
*d = sin(*d);
}

void function_and( double *d ) {
*d = cos(*d);
}

void function_or( double *d ) {
*d = tan(*d);
}

void function_xor( double *d ) {
*d = sqrt(*d);
}

void ( * const function_table[4] )( double* ) = { &function_not, &function_and, &function_or, &function_xor };

int main(void)
{
srand(time(0));
void ( * index_array[100000] )( double * );
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
    index_array[i] = function_table[ rand() % 4 ];
    array[i] = ( double )( rand() / 1000 );
}

struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( long int i = 0; i < 100000; ++i ) {
    index_array[i]( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}

and here is the virtual function variant:

#include <iostream>
#include <cstdlib>
#include <ctime>
#include <cmath>

using namespace std;

long long timespecDiff(struct timespec *timeA_p, struct timespec *timeB_p)
{
return ((timeA_p->tv_sec * 1000000000) + timeA_p->tv_nsec) -
    ((timeB_p->tv_sec * 1000000000) + timeB_p->tv_nsec);
}

class A {
public:
    virtual void calculate( double *i ) = 0;
};

class A1 : public A {
public:
    void calculate( double *i ) {
    *i = sin(*i);
    }
};

class A2 : public A {
public:
    void calculate( double *i ) {
        *i = cos(*i);
    }
};

class A3 : public A {
public:
    void calculate( double *i ) {
        *i = tan(*i);
    }
};

class A4 : public A {
public:
    void calculate( double *i ) {
        *i = sqrt(*i);
    }
};

int main(void)
{
srand(time(0));
A *base[100000];
double array[100000];
for ( long int i = 0; i < 100000; ++i ) {
    array[i] = ( double )( rand() / 1000 );
    switch ( rand() % 4 ) {
    case 0:
    base[i] = new A1();
    break;
    case 1:
    base[i] = new A2();
    break;
    case 2:
    base[i] = new A3();
    break;
    case 3:
    base[i] = new A4();
    break;
    }
}

struct timespec start, end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
for ( int i = 0; i < 100000; ++i ) {
    base[i]->calculate( &array[i] );
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);

unsigned long long time_elapsed = timespecDiff(&end, &start);
cout << time_elapsed / 1000000000.0 << endl;
}

My system is LInux, Fedora 13, gcc 4.4.2. The code is compiled it with g++ -O3. The first one is test1, the second is test2.

Now I see this in console:

[Ignat@localhost circuit_testing]$ ./test2 && ./test2 
0.0153142
0.0153166

Well, more or less, I think. And then, this:

[Ignat@localhost circuit_testing]$ ./test2 && ./test2 
0.01531
0.0152476

Where are the 25% which should be visible? How can the first executable be even slower than the second one?

I'm asking this because I'm doing a project which involves calling a lot of small functions in a row like this in order to compute the values of an array, and the code I've inherited does a very complex manipulation to avoid the virtual function call overhead. Now where is this famous call overhead?

+7  A: 

In both cases you are calling functions indirectly. In one case through your table of function pointers, and in the other through the compiler's array of function pointers (the vtable). Not surprisingly, two similar operations give you similar timing results.

anon
Well, from what I know, for a pointer call you need to load the address of a function to call, and that's it, whereas for the vtable it's at least the address of the vtable and then the pointer. There should be some reason to avoid virtual function class, after all. And this was what was said in the documentation for the project I inherited, so I just tried to see the difference.
Semen Semenych
@Semen Semenych: But once the vtable indirection has been cached by the CPU, the extra layer of indirection becomes extremely cheap. Performance-wise, the best reason to avoid vtable indirection is not "it's slow", but rather "it prevents some optimizations". It's not slow, it just prevents the code from being fast.
jalf
+2  A: 

Virtual functions may be slower than regular functions, but that's due to things like inlines. If you call a function through a function table, those can't be inlined either, and the lookup time is pretty much the same. Looking up through your own lookup table is of course going to be the same as looking up through the compiler's lookup table.
Edit: Or even slower, because the compiler knows a lot more than you about things like processor cache and such.

DeadMG
+2  A: 

Nowadays, on most systems, memory access is the primary bottleneck, and not the CPU. In many cases, there is little significant difference between virtual and non-virtual functions - they usually represent a very small portion of execution time. (Sorry, I don't have reported figures to back this up, just emprical data.)

If you want to get the best performance you will get more bang for your buck if you look into how to parallelize the computation to take advantage of multiple cores/processing units, rather than worring about micro-details of virtual vs non-virtual functions.

mdma
Already doing that, thanks for the advice. So it's not the CPU. I'll have to talk to that guy who wrote the old code.
Semen Semenych
No, that’s just not true in general. Virtual functions can incur a *huge* overhead because of failure to inline, failing branch predictions etc. And of course, even the mere added indirection can be costly in some scenarios (tight loops).
Konrad Rudolph
Yes, this is the answer, gotta stop overcomplicating my design with member function pointers and templates and focus on the real stuff.
Semen Semenych
If we're talking about inline here, then what will inline with guarantee : a function pointer, a member function pointer or is there something else which I don't know?
Semen Semenych
@Konrad - I think "in general" it is true, and "for sepcific cases", it is not. For a tight inner loop then you want to avoid virtual functions, if this really is where the overhead is. But in general, the difference is not worth worrying about - and we are talking generally. If you profile all of an entire app's virtual functions, you'll find that tyically only a handful have any signifiant overhead.
mdma
If you are worried about function call overhead, then add block-processing functions so that more than one array element can be processed at a time. Your block function can be virtual, but the inner function is called non-virtually, i.e MyClass::calc. Such a design also lends itself well to splitting the work into Tasks for handling by a task queue feeding multiple worker threads.
mdma
That could work, if it were not a graph that can only be traversed sequentially. What I'm doing is modelling a circuit, with continuous functions instead of logical ones. Anyway thanks for the answer. I think I'll go for member function pointers to get inline.
Semen Semenych
Ok, I don't know the details, but perhaps you could add nodes of your graph in sequence to a Vector and then process that. Either way, I would try to improve performance by a scalable design than try to improve performance incrementally using implementation tricks. Good luck, I hope you get the performance you are after in the end!
mdma
@Konrad Rudolph: True. But when function pointers are stored in tables they are just as incapable as being in-lined as virtual functions.
Martin York
@Martin York: I didn’t argue that point.
Konrad Rudolph
+2  A: 

Many people fall into the habit of doing things just because they are thought to be "faster". It's all relative.

If I'm going to take a 100-mile drive from my home, I have to start by driving around the block. I can drive around the block to the right, or to the left. One of those will be "faster". But will it matter? Of course not.

In this case, the functions that you call are in turn calling math functions.

If you pause the program under the IDE or GDB, I suspect you will find that nearly every time you pause it it will be in those math library routines (or it should be!), and dereferencing an additional pointer to get there (assuming it doesn't bust a cache) should be lost in the noise.

Added: Here is a favorite video: Harry Porter's relay computer. As that thing laboriously clacks away adding numbers and stepping its program counter, I find it helpful to be mindful that that's what all computers are doing, just on a different scale of time and complexity. In your case, think about an algorithm to do sin, cos, tan, or sqrt. Inside, it is clacking away doing those things, and only incidentally following addresses or messing with a really slow memory to get there.

Mike Dunlavey
A: 

And finally, the function pointer approach has turned out to be the fastest one. Which was what I'd expected from the very beginning.

Semen Semenych
+2  A: 

I think you're seeing the difference, but it's just the function call overhead. Branch misprediction, memory access and the trig functions are the same in both cases. Compared to those, it's just not that big a deal, though the function pointer case was definitely a bit quicker when I tried it.

If this is representative of your larger program, this is a good demonstration that this type of microoptimization is sometimes just a drop in the ocean, and at worst futile. But leaving that aside, for a clearer test, the functions should perform some simpler operation, that is different for each function:

void function_not( double *d ) {
    *d = 1.0;
}

void function_and( double *d ) {
    *d = 2.0;
}

And so on, and similarly for the virtual functions.

(Each function should do something different, so that they don't get elided and all end up with the same address; that would make the branch prediction work unrealistically well.)

With these changes, the results are a bit different. Best of 4 runs in each case. (Not very scientific, but the numbers are broadly similar for larger numbers of runs.) All timings are in cycles, running on my laptop. Code was compiled with VC++ (only changed the timing) but gcc implements virtual function calls in the same way so the relative timings should be broadly similar even with different OS/x86 CPU/compiler.

Function pointers: 2,052,770

Virtuals: 3,598,039

That difference seems a bit excessive! Sure enough, the two bits of code aren't quite the same in terms of their memory access behaviour. The second one should have a table of 4 A *s, used to fill in base, rather than new'ing up a new one for each entry. Both examples will then have similar behaviour (1 cache miss/N entries) when fetching the pointer to jump through. For example:

A *tbl[4] = { new A1, new A2, new A3, new A4 };
for ( long int i = 0; i < 100000; ++i ) {
    array[i] = ( double )( rand() / 1000 );
    base[i] = tbl[ rand() % 4 ];
}

With this in place, still using the simplified functions:

Virtuals (as suggested here): 2,487,699

So there's 20%, best case. Close enough?

So perhaps your colleague was right to at least consider this, but I suspect that in any realistic program the call overhead won't be enough of a bottleneck to be worth jumping through hoops over.

brone
While most is clear, I still fail to see any branches in the benchmarked section of the code for both variants. Is it the trigonometrical functions?And I didn't know anything about the differences between allocating A's this or that way, will have to look into that.There are have two more variants, one with a switch inside the measured code and one with member function pointers. The last one is the fastest, though I'll have to check with simplified functions.As far as 20% go, that's exactly what the article said, can't find it though, that was a PDF somewhere on the internet.
Semen Semenych
The branch here is the place where the target function is called - "index_array[i]( ", "base[i]->calculate( ". On modern x86 CPUs these benefit from prediction (albeit of a slightly different kind) just like conditional branches. Sorry if my terminology was a bit confusing.
brone
Also the speed difference between the two ways of allocating A is down to the data cache. See http://www.agner.org/optimize/, "Optimizing C++" pdf, there's a section about it in there.
brone