views:

155

answers:

11

Suppose I have a function in a single threaded program that looks like this

void f(some arguments){
    char buffer[32];
    some operations on buffer;
}

and f appears inside some loop that gets called often, so I'd like to make it as fast as possible. It looks to me like the buffer needs to get allocated every time f is called, but if I declare it to be static, this wouldn't happen. Is that correct reasoning? Is that a free speed up? And just because of that fact (that it's an easy speed up), does an optimizing compiler already do something like this for me?

+7  A: 

Incrementing 32 bytes on the stack will cost virtually nothing on nearly all systems. But you should test it out. Benchmark a static version and a local version and post back.

Inverse
+1 for suggesting a benchmark tool. Using a profiler should always be the first step in verifying if something 'faster' or 'slower'
linuxuser27
I already know from profiling that it didn't matter in my specific application, but it made me curious if it ever would be. And now I see why it shouldn't.
pythonic metaphor
A: 

With gcc, I do see some speedup:

void f() {
    char buffer[4096];
}

int main() {
    int i;
    for (i = 0; i < 100000000; ++i) {
        f();
    }
}

And the time:

$ time ./a.out

real    0m0.453s
user    0m0.450s
sys  0m0.010s

changing buffer to static:

$ time ./a.out

real    0m0.352s
user    0m0.360s
sys  0m0.000s
Colin
What, no optimizations? That’s not a meaningful benchmark. And of course, with optimizations enabled, the whole call to `f` will most likely be elided.
Konrad Rudolph
I just noticed the question was tagged C++, so I did it again compiling with g++ instead of gcc. Interestingly, the non-static version ran in ~0.5s, while the static version still ran in ~0.35s
Colin
Oh yeah...optimizations! Using the O2 flag, both versions run in 0.002s
Colin
The difference in time is not significant. That amount of time can change depending on task switching, other programs running and other OS overhead. The time wasted optimizing or proof of concept costs more (in developer time) than is recouped from the sale of the program.
Thomas Matthews
Ok, give me a little credit. I did run the tests multiple times, to make sure the times were stable.
Colin
If your compiler didn't optimize out the loop, that's not actually optimized.
DeadMG
A realistic function `f` will have more than one variable, which would change your timings I think.
Mark Ransom
+1  A: 

Depending on what exactly the variable is doing and how its used, the speed up is almost nothing to nothing. Because (on x86 systems) stack memory is allocated for all local vars at the same time with a simple single func(sub esp,amount), thus having just one other stack var eliminates any gain. the only exception to this is really huge buffers in which case a compiler might stick in _chkstk to alloc memory(but if your buffer is that big you should re-evaluate your code). The compiler cannot turn stack memory into static memory via optimization, as it cannot assume that the function is going to be used in a single threaded enviroment, plus it would mess with object constructors & destructors etc

Necrolis
+3  A: 

The way it is written now, there is no cost for allocation: the 32 bytes are on the stack. The only real work is you need to zero-initialize.

Local statics is not a good idea here. It wont be faster, and your function can't be used from multiple threads anymore, as all calls share the same buffer. Not to mention that local statics initialization is not guaranteed to be thread safe.

jdv
Normally there is no zero initialization for stack variables.
rstevens
+2  A: 

I would suggest that a more general approach to this problem is that if you have a function called many times that needs some local variables then consider wrapping it in a class and making these variables member functions. Consider if you needed to make the size dynamic, so instead of char buffer[32] you had std::vector<char> buffer(requiredSize). This is more expensive than an array to initialise every time through the loop

class BufferMunger {
public:
   BufferMunger() {};
   void DoFunction(args);
private:
   char buffer[32];
};

BufferMunger m;
for (int i=0; i<1000; i++) {
   m.DoFunction(arg[i]);  // only one allocation of buffer
}

There's another implication of making the buffer static, which is that the function is now unsafe in a multithreaded application, as two threads may call it and overwrite the data in the buffer at the same time. On the other hand it's safe to use a separate BufferMunger in each thread that requires it.

the_mandrill
+1: Now that you mention, I remember doing something like this for a class that implemented a fourier transform.
jdv
+2  A: 

For implementations that use a stack for local variables, often times allocation involves advancing a register (adding a value to it), such as the Stack Pointer (SP) register. This timing is very negligible, usually one instruction or less.

However, initialization of stack variables takes a little longer, but again, not much. Check out your assembly language listing (generated by compiler or debugger) for exact details. There is nothing in the standard about the duration or number of instructions required to initialize variables.

Allocation of static local variables is usually treated differently. A common approach is to place these variables in the same area as global variables. Usually all the variables in this area are initialized before calling main(). Allocation in this case is a matter of assigning addresses to registers or storing the area information in memory. Not much execution time wasted here.

Dynamic allocation is the case where execution cycles are burned. But that is not in the scope of your question.

Thomas Matthews
+1  A: 

If there are any local automatic variables in the function at all, the stack pointer needs to be adjusted. The time taken for the adjustment is constant, and will not vary based on the number of variables declared. You might save some time if your function is left with no local automatic variables whatsoever.

If a static variable is initialized, there will be a flag somewhere to determine if the variable has already been initialized. Checking the flag will take some time. In your example the variable is not initialized, so this part can be ignored.

Static variables should be avoided if your function has any chance of being called recursively or from two different threads.

Mark Ransom
+1  A: 

Note that block-level static variables in C++ (as opposed to C) are initialized on first use. This implies that you'll be introducing the cost of an extra runtime check. The branch potentially could end up making performance worse, not better. (But really, you should profile, as others have mentioned.)

Regardless, I don't think it's worth it, especially since you'd be intentionally sacrificing re-entrancy.

jamesdlin
+1  A: 

It will make the function substantially slower on most real cases. This is because the static data segment is not near the stack and you will lose cache coherency, so you will get a cache miss when you try to access it. However when you allocate a regular char[32] on the stack, it is right next to all your other needed data and costs very little to access. The initialization costs of a stack-based array of char are meaningless.

This is ignoring that statics have many other problems.

You really need to actually profile your code and see where the slowdowns are, because no profiler will tell you that allocating a statically-sized buffer of characters is a performance problem.

DeadMG
+1  A: 

If you are writing code for a PC, there is unlikely to be any meaningful speed advantage either way. On some embedded systems, it may be advantageous to avoid all local variables. On some other systems, local variables may be faster.

An example of the former: on the Z80, the code to set up the stack frame for a function with any local variables was pretty long. Further, the code to access local variables was limited to using the (IX+d) addressing mode, which was only available for 8-bit instructions. If X and Y were both global/static or both local variables, the statement "X=Y" could assemble as either:

; If both are static or global: 6 bytes; 32 cycles
  ld HL,(_Y) ; 16 cycles
  ld (_X),HL ; 16 cycles
; If both are local: 12 bytes; 56 cycles
  ld E,(IX+_Y)   ; 14 cycles
  ld D,(IX+_Y+1) ; 14 cycles
  ld (IX+_X),D   ; 14 cycles
  ld (IX+_X+1),E ; 14 cycles

A 100% code space penalty and 75% time penalty in addition to the code and time to set up the stack frame!

On the ARM processor, a single instruction can load a variable which is located within +/-2K of an address pointer. If a function's local variables total 2K or less, they may be accessed with a single instruction. Global variables will generally require two or more instructions to load, depending upon where they are stored.

supercat
+4  A: 

No, it's not a free speedup.

First, the allocation is almost free to begin with (since it consists merely of adding 32 to the stack pointer), and secondly, there are at least two reasons why a static variable might be slower

  • you lose cache locality. Data allocated on the stack are going to be in the CPU cache already, so accessing it is extremely cheap. Static data is allocated in a different area of memory, and so it may not be cached, and so it will cause a cache miss, and you'll have to wait hundreds of clock cycles for the data to be fetched from main memory.
  • you lose thread safety. If two threads execute the function simultaneously, it'll crash and burn, unless a lock is placed so only one thread at a time is allowed to execute that section of the code. And that would mean you'd lose the benefit of having multiple CPU cores.

So it's not a free speedup. But it is possible that it is faster in your case (although I doubt it). So try it out, benchmark it, and see what works best in your particular scenario.

jalf
I would add that your also lose reentrance: you don't have a clean variable each time you enter the function any longer (even if for built-ins this is always the case...)
Matthieu M.