views:

13259

answers:

21

This question may sound fairly elemental, but this is a debate I had with another developer I work with.

I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.

I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).

This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerks compiler for the PPC architecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.

+10  A: 

You can write a special heap allocator for specific sizes of objects that is very performant. However, the general heap allocator is not particularly performant.

Also I agree with Torbjörn Gyllebring about the expected lifetime of objects. Good point!

Chris Jester-Young
+111  A: 

Stackallocation is much faster since all it really does is move the stackpointer. Using memory pools you can get comparable performance out of heap allocation but that comes with a slight added complexity and its own headaches.

Also, stack vs heap is not only a performance consideration it does also tell you a lot about the expected lifetime of objects.

Torbjörn Gyllebring
And more important, stack is always hot, the memory you get is much more likely to be in cache than any far heap allocated memory
Benoît
On some (mostly embedded, that I know of) architectures, stack may be stored in fast on-die memory (e.g. SRAM). This can make a huge difference!
leander
+3  A: 

I don't think stack allocation and heap allocation are generally interchangable. I also hope that the performance of both of them is sufficient for general use.

I'd strongly recommend for small items, whichever one is more suitable to the scope of the allocation. For large items, the heap is probably necessary.

On 32-bit operating systems that have multiple threads, stack is often rather limited (albeit typically to at least a few mb), because the address space needs to be carved up and sooner or later one thread stack will run into another. On single threaded systems (Linux glibc single threaded anyway) the limitation is much less because the stack can just grow and grow.

On 64-bit operating systems there is enough address space to make thread stacks quite large.

MarkR
+43  A: 

Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, e.g. on x86:

sub esp, 0x10

(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)

Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do infinite recursion :-)

Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.

My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.

Dan
One instruction, and that is usually shared by ALL objects on the stack.
MSalters
Made the point well, especially the point about verifiably needing it. I'm continually amazed at how people's concerns about performance are misplaced.
Mike Dunlavey
"Deallocation" is also very simple and is done with single `leave` instruction.
doc
Keep in mind the "hidden" cost here, especially for the first time you extend the stack. Doing so might result in a page fault, a context switch to the kernel which needs to do some work to allocate the memory(or load it from swap, in worst case).
nos
+4  A: 

Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.

Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.

Windows programmer
I don't think the heap is "searched" unless it's paged. Pretty sure solid state memory uses a multiplexor and can gain direct access to the memory, hence the Random Access Memory.
Joe Philllips
Here's an example. The calling program asks to allocate 37 bytes. The library function looks for a block of at least 40 bytes. The first block on the free list has 16 bytes. The second block on the free list has 12 bytes. The third block has 44 bytes. The library stops searching at that point.
Windows programmer
+4  A: 

Also note that most heap allocation implementations are made thread-safe (which comes at a cost), whereas stack allocations are by default thread-safe.

eli
+1 If your application is heavily multithreaded and is doing a lot of memory allocation, contention for the malloc lock can become the dominating limit to scalability. That's why Intel's Threading Building Blocks, for example, come with their own version of malloc optimized for scalability.
Martin B
+1  A: 

I think the lifetime is crucial, and whether the thing being allocated has to be constructed in a complex way. For example, in transaction-driven modeling, you usually have to fill in and pass in a transaction structure with a bunch of fields to operation functions. Look at the OSCI SystemC TLM-2.0 standard for an example.

Allocating these on the stack close to the call to the operation tends to cause enormous overhead, as the construction is expensive. The good way there is to allocate on the heap and reuse the transaction objects either by pooling or a simple policy like "this module only needs one transaction object ever".

This is many times faster than allocating the object on each operation call.

The reason is simply that the object has an expensive construction and a fairly long useful lifetime.

I would say: try both and see what works best in your case, because it can really depend on the behavior of your code.

jakobengblom2
+2  A: 

Probably the biggest problem of heap allocation versus stack allocation, is that heap allocation in the general case is an unbounded operation, and thus you can't use it where timing is an issue.

For other applications where timing isn't an issue, it may not matter as much, but if you heap allocate a lot, this will affect the execution speed. Always try to use the stack for short lived and often allocated memory (for instance in loops), and as long as possible - do heap allocation during application startup.

larsivi
A: 

A stack has a limited capacity, while a heap is not. The typical stack for a process or thread is around 8K. You cannot change the size once it's allocated.

A stack variable follows the scoping rules, while a heap one doesn't. If your instruction pointer goes beyond a function, all the new variables associated with the function go away.

Most important of all, you can't predict the overall function call chain in advance. So a mere 200 bytes allocation on your part may raise a stack overflow. This is especially important if you're writing a library, not an application.

yogman
The amount of virtual address space allocated for a user mode stack on a modern OS is likely to be at least 64kB or larger by default (1MB on Windows). Are you talking about kernel stack sizes?
bk1e
On my machine, the default stack size for a process is 8MB, not kB. How old is your computer?
Greg Rogers
It was a cell phone.
yogman
+16  A: 

Honestly, it's trivial to write a program to compare the performance:

#include <ctime>
#include <iostream>

namespace {
    class empty { }; // even empty classes take up 1 byte of space, minimum
}

int main()
{
    std::clock_t start = std::clock();
    for (int i = 0; i < 100000; ++i)
        empty e;
    std::clock_t duration = std::clock() - start;
    std::cout << "stack allocation took " << duration << " clock ticks\n";
    start = std::clock();
    for (int i = 0; i < 100000; ++i) {
        empty* e = new empty;
        delete e;
    };
    duration = std::clock() - start;
    std::cout << "heap allocation took " << duration << " clock ticks\n";
}

It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.

AN OPTIMIZING COMPILER MAY NOTICE THAT THIS CODE DOES NOTHING, AND MAY OPTIMIZE IT ALL AWAY. IT IS THE OPTIMIZER'S JOB TO DO STUFF LIKE THAT, AND FIGHTING THE OPTIMIZER IS A FOOL'S ERRAND. I WOULD RECOMMEND COMPILING THIS CODE WITH OPTIMIZATION TURNED OFF BECAUSE THERE IS NO GOOD WAY TO FOOL EVERY OPTIMIZER CURRENTLY IN USE OR THAT WILL BE IN USE IN THE FUTURE. ANYBODY WHO TURNS THE OPTIMIZER ON AND THEN COMPLAINS ABOUT FIGHTING IT SHOULD BE SUBJECT TO PUBLIC RIDICULE. If I cared about nanosecond precision I wouldn't use std::clock(). If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.

The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:

  1. Add a data member to class empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."

  2. Declare e volatile, but volatile is often compiled incorrectly (PDF).

  3. Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.

Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement, but then keeping a reference will again skew the numbers.

On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.


Two comments about the code:

  1. Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.

    To be on the safe side, it would be possible to add a field to empty (but then the name "empty" would be incorrect) and access that field, but that would add variable costs to the loop (accessing a field directly is faster than accessing a field through a pointer). Taking the address of e would also work, and "should" take the same amount of time in both loops.

  2. I don't see a need to "calibrate the loops" because I don't care about the actual time spent, only about which loop takes longer. If initializing i, testing i, incrementing i and jumping to the beginning of the loop takes a fixed amount of time in loop A, it will take the same amount of time in loop B. I don't care what that amount if time is, because it doesn't affect the end result (fixed time spent in loop manipulation plus time spent in stack allocation is X, same amount of time spent in loop manipulation plus time spent in heap allocation is Y; X < Y). Heap allocation is still much worse.


I'll concede I misunderstood the comment about calibrating the loops. I'm leaving that portion so that anybody concerned about the things I had mentioned will have answers.

Max Lybbert
Your example is flawed: any decent compiler is going to take out the "empty e" line from the 3rd line (main sub) in the common sub-expression elimination and dead-code elimination phases. Better do that within a function and call it from the loop. Probably you should do the same for stack.
Joe Pineda
Also, you should add a "calibration" loop at the very beginning of your main function, something to give you an idea how much time per loop-cycle you're getting, and adjust the other loops so as to ensure your example runs for some amount of time, instead of the fixed constant you're using.
Joe Pineda
Calibration loop: if it runs for too short, your results won't be statistically significant. When you moved it to 10 million, you got significant results. However, on my old 386 the original 100K would have been more than adequate for the task, hence the need for the calibration loop.
Joe Pineda
I'm also glad increasing the number of times each option loop runs (plus instructing g++ not to optimize?) has yielded significant results. So now we have hard facts to say stack is faster. Thanks for your efforts!
Joe Pineda
What you are claiming is *wrong*. The compiler won't emit any code to grow the stack for your empty class. I compiled your source without optimization under VS 2005 and disassembled the exe with IDA - no stack allocation at all.
jn_
What I am claiming is (1) a test program is trivial, and (2) GCC has these performance characteristics. The C++ standard requires that objects take up at least one byte of space; if Microsoft isn't following the standard I'm sure you can figure out how to make the object take up space.
Max Lybbert
The answer also said "To be on the safe side, it would be possible to add a field to empty ... and access that field. ... Taking the address of e would also work, and 'should' take the same amount of time in both loops." I find it odd you disassembled an exe but didn't read the whole answer.
Max Lybbert
Well it's the same for GCC at least with optimization turned on - no stack allocation even with additional fields
jn_
That's why I turned optimization off. I'm not trying to find the best timing results ever. I'm not going to publish the results as part of a doctoral thesis. I simply wanted to demonstrate that malloc is hundreds of times slower than stack allocation.
Max Lybbert
It's the optimizer's job to get rid of code like this. Is there a good reason to turn the optimizer on and then prevent it from actually optimizing? I've edited the answer to make things even clearer: if you enjoy fighting the optimizer, be prepared to learn how smart compiler writers are.
Max Lybbert
You are only allocating the object on the stack once. To allocate 10000 objects, you need to do empty array[10000];
erikkallen
+2  A: 

It's not jsut stack allocation that's faster. You also win a lot on using stack variables. They have better locality of reference. And finally, deallocation is a lot cheaper too.

MSalters
A: 

There's a general point to be made about such optimizations.

The optimization you get is proportional to the amount of time the program counter is actually in that code.

If you sample the program counter, you will find out where it spends its time, and that is usually in a tiny part of the code, and often in library routines you have no control over.

Only if you find it spending much time in the heap-allocation of your objects will it be noticeably faster to stack-allocate them.

Mike Dunlavey
A: 

Never do premature assumption as other application code and usage can impact your function. So looking at function is isolation is of no use.

If you are serious with application then VTune it or use any similar profiling tool and look at hotspots.

Ketan

Ketan
A: 

It has been mentioned before that stack allocation is simply moving the stack pointer, that is, a single instruction on most architectures. Compare that to what generally happens in the case of heap allocation.

The operating system maintains portions of free memory as a linked list with the payload data consisting of the pointer to the starting address of the free portion and the size of the free portion. To allocate X bytes of memory, the link list is traversed and each note is visited in sequence, checking to see if its size is at least X. When a portion with size P >= X is found, P is split into two parts with sizes X and P-X. The linked list is updated and the pointer to the first part is returned.

As you can see, heap allocation depends on may factors like how much memory you are requesting, how fragmented the memory is and so on.

Nikhil
A: 

In general, stack allocation is faster than heap allocation as mentioned by almost every answer above. A stack push or pop is O(1), whereas allocating or freeing from a heap could require a walk of previous allocations. However you shouldn't usually be allocating in tight, performance-intensive loops, so the choice will usually come down to other factors.

It might be good to make this distinction: you can use a "stack allocator" on the heap. Strictly speaking, I take stack allocation to mean the actual method of allocation rather than the location of the allocation. If you're allocating a lot of stuff on the actual program stack, that could be bad for a variety of reasons. On the other hand, using a stack method to allocate on the heap when possible is the best choice you can make for an allocation method.

Since you mentioned Metrowerks and PPC, I'm guessing you mean Wii. In this case memory is at a premium, and using a stack allocation method wherever possible guarantees that you don't waste memory on fragments. Of course, doing this requires a lot more care than "normal" heap allocation methods. It's wise to evaluate the tradeoffs for each situation.

Dan Olson
+8  A: 

An interesting thing I learned about Stack vs. Heap Allocation on the Xbox 360 Xenon processor, which may also apply to other multicore systems, is that allocating on the Heap causes a Critical Section to be entered to halt all other cores so that the alloc doesn't conflict. Thus, in a tight loop, Stack Allocation was the way to go for fixed sized arrays as it prevented stalls.

This may be another speedup to consider if you're coding for multicore/multiproc, in that your stack allocation will only be viewable by the core running your scoped function, and that will not affect any other cores/CPUs.

Furious Coder
That's true of most multicore machines, not just the Xenon. Even Cell has to do it because you might be running two hardware threads on that PPU core.
Crashworks
That's an effect of the (particularly poor) implementation of the heap allocator. Better heap allocators need not acquire a lock on every allocation.
Chris Dodd
+2  A: 

Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.

Jay
A: 

why not simply replace empty e; with something like int j=i; that would make sure that stack allocation did take place.

sactiw
A: 

Stack allocation will almost always be as fast or faster than heap allocation, although it is certainly possible for a heap allocator to simply use a stack based allocation technique.

However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.

On the other hand, stack-based allocation on a 32-bit memory range is also fraught with peril if you have too many threads. Stacks need a contiguous memory range, so the more threads you have, the more virtual address space you will need for them to run without a stack overflow. This won't be a problem (for now) with 64-bit, but it can certainly wreak havoc in long running programs with lots of threads. Running out of virtual address space due to fragmentation is always a pain to deal with.

MSN
A: 

Stack allocation is a couple instructions whereas the fastest rtos heap allocator known to me (TLSF) uses on average on the order of 150 instructions. Also stack allocations don't require a lock because they use thread local storage which is another huge performance win. So stack allocations can be 2-3 orders of magnitude faster depending on how heavily multithreaded your environment is.

In general heap allocation is your last resort if you care about performance. A viable in-between option can be a fixed pool allocator which is also only a couple instructions and has very little per-allocation overhead so it's great for small fixed size objects. On the downside it only works with fixed size objects, is not inherently thread safe and has block fragmentation problems.

Andrei Pokrovsky
A: 

stack allocation is much faster.

Atul