ansaurus

Question

C++ Which is faster: Stack allocation or Heap allocation

Answer 1

+10 A:

You can write a special heap allocator for specific sizes of objects that is very performant. However, the general heap allocator is not particularly performant.

Also I agree with Torbjörn Gyllebring about the expected lifetime of objects. Good point!

Chris Jester-Young 2008-10-02 06:08:22

Answer 2

+111 A:

Stackallocation is much faster since all it really does is move the stackpointer. Using memory pools you can get comparable performance out of heap allocation but that comes with a slight added complexity and its own headaches.

Also, stack vs heap is not only a performance consideration it does also tell you a lot about the expected lifetime of objects.

Torbjörn Gyllebring 2008-10-02 06:09:03

And more important, stack is always hot, the memory you get is much more likely to be in cache than any far heap allocated memory

Benoît 2009-04-10 10:29:06

On some (mostly embedded, that I know of) architectures, stack may be stored in fast on-die memory (e.g. SRAM). This can make a huge difference!

leander 2009-07-15 01:16:12

Answer 3

+3 A:

I don't think stack allocation and heap allocation are generally interchangable. I also hope that the performance of both of them is sufficient for general use.

I'd strongly recommend for small items, whichever one is more suitable to the scope of the allocation. For large items, the heap is probably necessary.

On 32-bit operating systems that have multiple threads, stack is often rather limited (albeit typically to at least a few mb), because the address space needs to be carved up and sooner or later one thread stack will run into another. On single threaded systems (Linux glibc single threaded anyway) the limitation is much less because the stack can just grow and grow.

On 64-bit operating systems there is enough address space to make thread stacks quite large.

MarkR 2008-10-02 06:12:26

Answer 4

+43 A:

Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, e.g. on x86:

sub esp, 0x10

(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)

Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do infinite recursion :-)

Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.

My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.

Dan 2008-10-02 06:16:44

One instruction, and that is usually shared by ALL objects on the stack.

MSalters 2008-10-03 15:32:25

Made the point well, especially the point about verifiably needing it. I'm continually amazed at how people's concerns about performance are misplaced.

Mike Dunlavey 2009-01-27 20:29:54

"Deallocation" is also very simple and is done with single `leave` instruction.

doc 2010-07-28 21:23:32

Keep in mind the "hidden" cost here, especially for the first time you extend the stack. Doing so might result in a page fault, a context switch to the kernel which needs to do some work to allocate the memory(or load it from swap, in worst case).

nos 2010-08-17 20:41:35

Answer 5

+4 A:

Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.

Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.

Windows programmer 2008-10-02 06:18:26

I don't think the heap is "searched" unless it's paged. Pretty sure solid state memory uses a multiplexor and can gain direct access to the memory, hence the Random Access Memory.

Joe Philllips 2008-10-02 17:01:23

Here's an example. The calling program asks to allocate 37 bytes. The library function looks for a block of at least 40 bytes. The first block on the free list has 16 bytes. The second block on the free list has 12 bytes. The third block has 44 bytes. The library stops searching at that point.

Windows programmer 2008-10-02 23:34:02

Answer 6

+4 A:

Also note that most heap allocation implementations are made thread-safe (which comes at a cost), whereas stack allocations are by default thread-safe.

eli 2008-10-02 06:41:46

+1 If your application is heavily multithreaded and is doing a lot of memory allocation, contention for the malloc lock can become the dominating limit to scalability. That's why Intel's Threading Building Blocks, for example, come with their own version of malloc optimized for scalability.

Martin B 2010-08-17 20:39:57

Answer 7

+1 A:

I think the lifetime is crucial, and whether the thing being allocated has to be constructed in a complex way. For example, in transaction-driven modeling, you usually have to fill in and pass in a transaction structure with a bunch of fields to operation functions. Look at the OSCI SystemC TLM-2.0 standard for an example.

Allocating these on the stack close to the call to the operation tends to cause enormous overhead, as the construction is expensive. The good way there is to allocate on the heap and reuse the transaction objects either by pooling or a simple policy like "this module only needs one transaction object ever".

This is many times faster than allocating the object on each operation call.

The reason is simply that the object has an expensive construction and a fairly long useful lifetime.

I would say: try both and see what works best in your case, because it can really depend on the behavior of your code.

jakobengblom2 2008-10-02 06:43:14

Answer 8

+2 A:

Probably the biggest problem of heap allocation versus stack allocation, is that heap allocation in the general case is an unbounded operation, and thus you can't use it where timing is an issue.

For other applications where timing isn't an issue, it may not matter as much, but if you heap allocate a lot, this will affect the execution speed. Always try to use the stack for short lived and often allocated memory (for instance in loops), and as long as possible - do heap allocation during application startup.

larsivi 2008-10-02 08:34:12

Answer 9

A:

A stack has a limited capacity, while a heap is not. The typical stack for a process or thread is around 8K. You cannot change the size once it's allocated.

A stack variable follows the scoping rules, while a heap one doesn't. If your instruction pointer goes beyond a function, all the new variables associated with the function go away.

Most important of all, you can't predict the overall function call chain in advance. So a mere 200 bytes allocation on your part may raise a stack overflow. This is especially important if you're writing a library, not an application.

yogman 2008-10-02 16:57:39

The amount of virtual address space allocated for a user mode stack on a modern OS is likely to be at least 64kB or larger by default (1MB on Windows). Are you talking about kernel stack sizes?

bk1e 2008-10-03 03:19:07

On my machine, the default stack size for a process is 8MB, not kB. How old is your computer?

Greg Rogers 2009-01-28 14:27:28

It was a cell phone.

yogman 2009-01-30 06:00:58

Answer 10

+16 A:

Honestly, it's trivial to write a program to compare the performance:

#include <ctime>
#include <iostream>

namespace {
    class empty { }; // even empty classes take up 1 byte of space, minimum
}

int main()
{
    std::clock_t start = std::clock();
    for (int i = 0; i < 100000; ++i)
        empty e;
    std::clock_t duration = std::clock() - start;
    std::cout << "stack allocation took " << duration << " clock ticks\n";
    start = std::clock();
    for (int i = 0; i < 100000; ++i) {
        empty* e = new empty;
        delete e;
    };
    duration = std::clock() - start;
    std::cout << "heap allocation took " << duration << " clock ticks\n";
}

It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.

AN OPTIMIZING COMPILER MAY NOTICE THAT THIS CODE DOES NOTHING, AND MAY OPTIMIZE IT ALL AWAY. IT IS THE OPTIMIZER'S JOB TO DO STUFF LIKE THAT, AND FIGHTING THE OPTIMIZER IS A FOOL'S ERRAND. I WOULD RECOMMEND COMPILING THIS CODE WITH OPTIMIZATION TURNED OFF BECAUSE THERE IS NO GOOD WAY TO FOOL EVERY OPTIMIZER CURRENTLY IN USE OR THAT WILL BE IN USE IN THE FUTURE. ANYBODY WHO TURNS THE OPTIMIZER ON AND THEN COMPLAINS ABOUT FIGHTING IT SHOULD BE SUBJECT TO PUBLIC RIDICULE. If I cared about nanosecond precision I wouldn't use std::clock(). If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.

The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:

Add a data member to class empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."
Declare e volatile, but volatile is often compiled incorrectly (PDF).
Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.

Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement, but then keeping a reference will again skew the numbers.

On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.

Two comments about the code:

Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.

To be on the safe side, it would be possible to add a field to empty (but then the name "empty" would be incorrect) and access that field, but that would add variable costs to the loop (accessing a field directly is faster than accessing a field through a pointer). Taking the address of e would also work, and "should" take the same amount of time in both loops.
I don't see a need to "calibrate the loops" because I don't care about the actual time spent, only about which loop takes longer. If initializing i, testing i, incrementing i and jumping to the beginning of the loop takes a fixed amount of time in loop A, it will take the same amount of time in loop B. I don't care what that amount if time is, because it doesn't affect the end result (fixed time spent in loop manipulation plus time spent in stack allocation is X, same amount of time spent in loop manipulation plus time spent in heap allocation is Y; X < Y). Heap allocation is still much worse.

I'll concede I misunderstood the comment about calibrating the loops. I'm leaving that portion so that anybody concerned about the things I had mentioned will have answers.

Max Lybbert 2008-10-02 18:11:57

Your example is flawed: any decent compiler is going to take out the "empty e" line from the 3rd line (main sub) in the common sub-expression elimination and dead-code elimination phases. Better do that within a function and call it from the loop. Probably you should do the same for stack.

Joe Pineda 2008-10-02 19:37:06

Also, you should add a "calibration" loop at the very beginning of your main function, something to give you an idea how much time per loop-cycle you're getting, and adjust the other loops so as to ensure your example runs for some amount of time, instead of the fixed constant you're using.

Joe Pineda 2008-10-02 19:40:26

Calibration loop: if it runs for too short, your results won't be statistically significant. When you moved it to 10 million, you got significant results. However, on my old 386 the original 100K would have been more than adequate for the task, hence the need for the calibration loop.

Joe Pineda 2008-10-02 21:30:31

I'm also glad increasing the number of times each option loop runs (plus instructing g++ not to optimize?) has yielded significant results. So now we have hard facts to say stack is faster. Thanks for your efforts!

Joe Pineda 2008-10-02 21:33:15

What you are claiming is *wrong*. The compiler won't emit any code to grow the stack for your empty class. I compiled your source without optimization under VS 2005 and disassembled the exe with IDA - no stack allocation at all.

jn_ 2009-03-02 01:10:01

What I am claiming is (1) a test program is trivial, and (2) GCC has these performance characteristics. The C++ standard requires that objects take up at least one byte of space; if Microsoft isn't following the standard I'm sure you can figure out how to make the object take up space.

Max Lybbert 2009-03-02 08:21:33

The answer also said "To be on the safe side, it would be possible to add a field to empty ... and access that field. ... Taking the address of e would also work, and 'should' take the same amount of time in both loops." I find it odd you disassembled an exe but didn't read the whole answer.

Max Lybbert 2009-03-02 08:25:33

Well it's the same for GCC at least with optimization turned on - no stack allocation even with additional fields

jn_ 2009-03-03 23:34:33

That's why I turned optimization off. I'm not trying to find the best timing results ever. I'm not going to publish the results as part of a doctoral thesis. I simply wanted to demonstrate that malloc is hundreds of times slower than stack allocation.

Max Lybbert 2009-03-04 07:08:42

It's the optimizer's job to get rid of code like this. Is there a good reason to turn the optimizer on and then prevent it from actually optimizing? I've edited the answer to make things even clearer: if you enjoy fighting the optimizer, be prepared to learn how smart compiler writers are.

Max Lybbert 2009-03-04 07:50:21

You are only allocating the object on the stack once. To allocate 10000 objects, you need to do empty array[10000];

erikkallen 2009-12-22 23:56:53

Answer 11

+2 A:

It's not jsut stack allocation that's faster. You also win a lot on using stack variables. They have better locality of reference. And finally, deallocation is a lot cheaper too.

MSalters 2008-10-03 15:35:41

Answer 12

A:

There's a general point to be made about such optimizations.

The optimization you get is proportional to the amount of time the program counter is actually in that code.

If you sample the program counter, you will find out where it spends its time, and that is usually in a tiny part of the code, and often in library routines you have no control over.

Only if you find it spending much time in the heap-allocation of your objects will it be noticeably faster to stack-allocate them.

Mike Dunlavey 2009-01-27 20:41:51

Answer 13

A:

Never do premature assumption as other application code and usage can impact your function. So looking at function is isolation is of no use.

If you are serious with application then VTune it or use any similar profiling tool and look at hotspots.

Ketan

Ketan 2009-02-04 17:21:37

Answer 14

A:

It has been mentioned before that stack allocation is simply moving the stack pointer, that is, a single instruction on most architectures. Compare that to what generally happens in the case of heap allocation.

The operating system maintains portions of free memory as a linked list with the payload data consisting of the pointer to the starting address of the free portion and the size of the free portion. To allocate X bytes of memory, the link list is traversed and each note is visited in sequence, checking to see if its size is at least X. When a portion with size P >= X is found, P is split into two parts with sizes X and P-X. The linked list is updated and the pointer to the first part is returned.

As you can see, heap allocation depends on may factors like how much memory you are requesting, how fragmented the memory is and so on.

Nikhil 2009-03-02 00:46:06

Answer 15

A:

In general, stack allocation is faster than heap allocation as mentioned by almost every answer above. A stack push or pop is O(1), whereas allocating or freeing from a heap could require a walk of previous allocations. However you shouldn't usually be allocating in tight, performance-intensive loops, so the choice will usually come down to other factors.

It might be good to make this distinction: you can use a "stack allocator" on the heap. Strictly speaking, I take stack allocation to mean the actual method of allocation rather than the location of the allocation. If you're allocating a lot of stuff on the actual program stack, that could be bad for a variety of reasons. On the other hand, using a stack method to allocate on the heap when possible is the best choice you can make for an allocation method.

Since you mentioned Metrowerks and PPC, I'm guessing you mean Wii. In this case memory is at a premium, and using a stack allocation method wherever possible guarantees that you don't waste memory on fragments. Of course, doing this requires a lot more care than "normal" heap allocation methods. It's wise to evaluate the tradeoffs for each situation.

Dan Olson 2009-03-02 01:36:43

Answer 16

+8 A:

An interesting thing I learned about Stack vs. Heap Allocation on the Xbox 360 Xenon processor, which may also apply to other multicore systems, is that allocating on the Heap causes a Critical Section to be entered to halt all other cores so that the alloc doesn't conflict. Thus, in a tight loop, Stack Allocation was the way to go for fixed sized arrays as it prevented stalls.

This may be another speedup to consider if you're coding for multicore/multiproc, in that your stack allocation will only be viewable by the core running your scoped function, and that will not affect any other cores/CPUs.

Furious Coder 2009-03-02 01:55:50

That's true of most multicore machines, not just the Xenon. Even Cell has to do it because you might be running two hardware threads on that PPU core.

Crashworks 2009-03-02 02:21:42

That's an effect of the (particularly poor) implementation of the heap allocator. Better heap allocators need not acquire a lock on every allocation.

Chris Dodd 2009-10-26 17:50:17

Answer 17

+2 A:

Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.

Jay 2009-10-26 17:36:00

Answer 18

A:

why not simply replace empty e; with something like int j=i; that would make sure that stack allocation did take place.

sactiw 2010-08-10 16:13:48

Answer 19

A:

Stack allocation will almost always be as fast or faster than heap allocation, although it is certainly possible for a heap allocator to simply use a stack based allocation technique.

However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.

On the other hand, stack-based allocation on a 32-bit memory range is also fraught with peril if you have too many threads. Stacks need a contiguous memory range, so the more threads you have, the more virtual address space you will need for them to run without a stack overflow. This won't be a problem (for now) with 64-bit, but it can certainly wreak havoc in long running programs with lots of threads. Running out of virtual address space due to fragmentation is always a pain to deal with.

MSN 2010-08-10 16:27:39

Answer 20

A:

Stack allocation is a couple instructions whereas the fastest rtos heap allocator known to me (TLSF) uses on average on the order of 150 instructions. Also stack allocations don't require a lock because they use thread local storage which is another huge performance win. So stack allocations can be 2-3 orders of magnitude faster depending on how heavily multithreaded your environment is.

In general heap allocation is your last resort if you care about performance. A viable in-between option can be a fixed pool allocator which is also only a couple instructions and has very little per-allocation overhead so it's great for small fixed size objects. On the downside it only works with fixed size objects, is not inherently thread safe and has block fragmentation problems.

Andrei Pokrovsky 2010-08-17 20:22:42

Answer 21

A:

stack allocation is much faster.

Atul 2010-08-17 20:56:09

ansaurus

tags:

views:

answers:

C++ Which is faster: Stack allocation or Heap allocation

related questions