views:

384

answers:

4

Hi,

Is the fastcall calling convention really faster than other calling conventions, such as cdecl? Are there any benchmarks out there that show how performance is affected by calling convention?

+3  A: 

Is the fastcall calling convention really faster than other calling conventions, such as cdecl?

I believe that Microsofts implementation of fastcall on x86 and x64 involves passing the first two parameters in registers instead of on the stack.

Since it typically saves at least four memory accesses, yes it is generally faster. However, if the function involved is register-starved and is thus likely to write them to locals on the stack anyway, there's not likely to be a significant increase.

Anon.
+8  A: 

It depends on the platform. For a Xenon PowerPC, for example, it can be an order of magnitude difference due to a load-hit-store issue with passing data on the stack. I empirically timed the overhead of a cdecl function at about 45 cycles compared to ~4 for a fastcall.

For an out-of-order x86 (Intel and AMD), the impact may be much less, because the registers are all shadowed and renamed anyway.

The answer really is that you need to benchmark it yourself on the particular platform you care about.

Crashworks
+1  A: 

Calling convention (at least on x86) doesn't really make much of a difference in speed. I do note, however, that there is a reason _fastcall is not the default _stdcall usually results on smaller code size over _cdecl. There is a reason _fastcall is not the default value. What you make up for in argument passing via registers you lose in less efficient function bodies (as previously mentioned by Anon.).

However, we can spout theoretical ideas all day long -- benchmark your code for the right answer. _fastcall will be faster in some cases, and slower in others.

Billy ONeal
+1  A: 

On modern x86 - no. Between L1 cache and in-lining there's no place for fastcall.

ima
If a function is inlined it is neither fastcall nor cdecl nor any other calling convention.
Crashworks
Exactly. Fetching from L1 is 1 cycle over register - in most cases it's below noise level, it's hard to even benchmark it reliably. And functions where a few cycles on call are important difference should be inlined anyway.
ima