views:

1041

answers:

10

This question is mystifying me for years and considering this site's name, this is the place to ask.

Why do we, programmers, still have this StackOverflow problem?

Why in every major language does the thread stack memory have to be statically allocated on thread creation?

I will speak in the context of C#/Java, because I use them most, but this is probably a broader problem.

Fixed stack size leads to huge problems:

  • There is no way to write a recursive algorithm unless you are absolutely sure that the depth of recursion is tiny. Linear memory complexity of the recursive algorithm is often unacceptable.
  • There is no cheap way to start new threads. You have to allocate huge block of memory for stack to account for all the possible uses of the thread.
  • Even if you don't use very deep recursion, you always have a risk of running out of stack space for the reason that the stack size is an arbitrary fixed number. Considering that the StackOverflow is usually unrecoverable, this is a big problem in my eyes.

Now, if the stack was resized dynamically, all of the problems above would be much alleviated, because stack overflow would only be possible when there is a memory overflow.

But this is not the case yet. Why? Are there some fundamental limitations of modern CPUs which would make it impossible/inefficient? If you think about the performance hit that reallocations would impose, it should be acceptable because people use structures like ArrayList all the time without suffering much.

So, the question is, am I missing something and the StackOverflow is not a problem, or am I missing something and there are a lot of languages with dynamic stack, or is there some big reason for this being impossible/hard to implement?

Edit: Some people said that performance would be a large problem, but consider this:

  • We leave the compiled code untouched. The stack access stays the same, thus the "usual case" performance stays the same.
  • We handle CPU exception which happens when the code tries to access the unallocated memory and launch our "reallocation" routine. Reallocations won't be frequent because <put your usual ArrayList argument here>. Should work on most protected-mode CPUs without loss of performance. No?
+16  A: 

1) In order to resize stacks, you have to be able to move memory around, meaning that pointers to anything on a stack can become invalid after a stack resize. Yes, you can use another level of indirection to solve this problem, but remember that the stack is used very, very frequently.

2) It significantly makes things more complicated. Push/pop operations on stacks usually work simply by doing some pointer arithmetic on a CPU register. That's why allocation on a stack is faster than allocation on the free-store.

3) Some CPUs (microcontrollers in particular) implement the stack directly on hardware, separate from the main memory.

Also, you can set the size of a stack of a thread when you create a new thread using beginthread(), so if you find that the extra stack space is unnecessary, you can set the stack size accordingly.

From my experience, stack overflows are usually caused by infinite recursions or recursive functions that allocate huge arrays on the stack. According to MSDN, the default stack size set by the linker is 1MB (the header of executable files can set their own default), which seems to be more than big enough for a majority of cases.

The fixed-stack mechanism works well enough for a majority of applications, so there's no real need to go change it. If it doesn't, you can always roll out your own stack.

In silico
3) it's bad practice to have static data in functions
fazo
Hmm, 1) is a valid argument. On platforms not limited with address space (read x64) it could be solved easily by leaving large unallocated address-space blocks for every thread, but for 32-bit you would indeed need to update pointers. However, this should not be a problem for managed languages. I am not sure about 2) because you probably could still do your pointer arithmetic and allocate additional memory when you encounter a segfault.
Rotsor
@fazo, int is a static data. Do you not have any ints in functions?
Rotsor
@Rotsor, i meant static tables in for example recursive functions
fazo
I think 1) still holds, because you're still limited on memory, and you can't step over other programs that are running. And for 2), the point I was trying to get across is that it adds complexity, and we already know how complicated computers are already.
In silico
Another big reason is tail recursion. In languages with proper tail recursion, you can recurse without using extra stack everywhere you could use a loop. And if I recall correctly, tail recursion allows you to express exactly the same class of programs without using extra space.
Antal S-Z
(1) is totally wrong. We don't need continues stack space. It's no problem to allocate a new stack segement via malloc when the current local variables would overflow. There is no need for a resize operation and for (2) it could be part of the CPU architecture.
Lothar
You'd think that, in modern non-embedded architectures, (1) could be handled by the memory manager (a level of indirection, true, but hardware-based and very fast). Simply allocate it a large memory space to begin with. Assuming you don't need the address range for anything else in that process, the system allocates physical memory as needed.
David Thornley
+20  A: 

I've never personally encountered a stack overflow that wasn't caused by infinite recursion. In these cases, a dynamic stack size wouldn't help, it would just take a little longer to run out of memory.

Tesserex
more, think about recursive loop, that has infinite memory... say bye bye to your memory
fazo
You are right, 99.9999999% of all stack overflow error are becuase of some poorly written recursive function. Erlang has proper tail recursion and no problems with stack since no state have to be saved on the stack.
fuzzy lollipop
I did. In C++ try to add variable "char buffer[1024*1024*10]" in any function.
SigTerm
I had a number of cases when I thought "wow, I have a nice recursive algorithm", and then thought "and what is the maximum input data size? Oh, no, stack-limited again :(", and was forced to make an iterative algorithm instead of recursive one.
Rotsor
It's not too hard to run into a stack overflow due to lots of large temporary objects: see http://blogs.msdn.com/b/oldnewthing/archive/2006/03/29/563939.aspx for example.
Adam Rosenfield
@Rotsor: this is Dynamic Programming, which maps the recursive algorithm to an equivalent iterative one. As an added bonus, the iterative version will be more efficient in both run time and memory consumption.
Borealid
@Borealid: You are wrong. That is not Dynamic Programming. Dynamic Programming is the technique of converting the problem with optimal substructures to an algorithm, which is defined in recursive way and I like implementing it as recursive.
Rotsor
A: 

Any code that would cause a stack overflow on a typical static-length stack is wrong anyway.

  • You could make the stack a std::vector-like object, but you'd have extremely unpredictable performance when it decided to resize -- and anyway, it would most likely just keep doing it until all the heap was exhausted too, and that's more annoying.
  • You could make it like a std::list, where it grew at O(1). However, the pointer arithmetic used on a static stack is so totally critical in every way to program performance that it would be uselessly slow. Languages were invented to have one return value and arbitrary numbers of input parameters because that's what fit the static stack/pointer arithmetic paradigm.

So a dynamically resizable stack would be A) a performance nightmare and B) of no value anyway, since your stack shouldn't have gotten that deep.

Scott Stafford
See my edit in response to "performance nightmare". Additionally, if one needs predictable performance, he would be free to pre-allocate the stack in advance, but this would never be needed in practice.
Rotsor
+5  A: 

Stacks are resized dynamically - or to be precise, grown dynamically. You get an overflow when a stack cannot grow any further, which is not to say it exhausted the address space, but rather grown to conflict with a portion of memory used to other purposes (e.g., a process heap).

Maybe you mean that stacks cannot be moved dynamically? The root of that is probably that stacks are intimately coupled to the hardware. CPUs have registers and piles of logic dedicated to thread stack management (esp, ebp, call/return/enter/leave instructions on x86). If your language is compiled (or even jitted) you're bound to the hardware mechanism and cannot move stacks around.

This hardware 'limitation' is probably here to stay. Re-basing a thread stack during thread execution seems far from a reasonable demand from a hardware platform (and the added complexity would badly hamper all executed code on such an imaginary CPU, even compiled). One can picture a completely virtualized environment where this limitation does not hold, but since such code couldn't be jitted - it would be unbearably slow. Not a chance you could do anything interactive with it.

Ofek Shilon
Are you saying that if I create a thread having some gigabytes of stack, that would not eat up all my memory instantly? I think that is wrong (at least for C# and Java).
Rotsor
Sane runtime libraries do not commit the entire stack space; if on 64-bit you can reserve gigabytes of stack space and only commit a few hundred kilobytes. To grow the stack you simply commit a few more pages from the reserved area.
wj32
Even if you could get the hardware to cooperate with rebasing a stack, you'd still have the problem that all pointers would need to be rebased as well. In low-level languages like C and C++, you just can't move memory around at will, because you have no idea who has a pointer to it. Not even scanning the entire address space for potential pointers would work, since you could run into accidental false positives.
Adam Rosenfield
You might benefit *some* stack space, but not much, from pre-allocating a larger stack. First, if you pre-allocate some gigabytes you'd risk not leaving enough space for other mandatory memory consumers (heaps, binary images, shared pages, other thread stacks etc.), and second - by default stacks have quite enough growth space to start with. If you have a specific scenario that mandates moderatly large stacks, you might try it though.
Ofek Shilon
@Adam Rosenfield, yes, you are right of course.@wj32, this is news for me. Because the runtime I tested with did not have this feature.
Rotsor
Well I said "sane runtime libraries" for a reason. IIRC .NET actually commits the entire 1MB of stack space.
wj32
@Rotsor: There's a difference between reserving address space and reserving memory. @Ofek: Most RISC architectures have no explicit stack support. You never move a stack simply because doing so would be inefficient.
Artelius
+1  A: 

Why in every major language does the thread stack memory have to be statically allocated on thread creation?

Stack size and allocation is not necessarily related to the language you are using. It is more a question of processor and architecture.

Stack Segments are limited to 4GB on current Intel processors.

This following link is a good read, that may give you some of the answers you seek.

http://www.intel.com/Assets/PDF/manual/253665.pdf - Chapter 6.2

+3  A: 

Why do we, programmers, still have this StackOverflow problem?

Stack of fixed size is easy to implement, and is acceptable for 99% of programs. "stack overflow" is a minor problem, that is somewhat rare. So there is no real reason to change things. Also, it is not a language problem, it is more related to platform/processor design, so you'll have to deal with it.

There is no way to write a recursive algorithm unless you are absolutely sure that the depth of recursion is tiny. Linear memory complexity of the recursive algorithm is often unacceptable.

Now this is incorrect. In recursive algorithm you can (almost?) always replace actual recursive call with some kind of container - list, std::vector, stack, array, FIFO queue, etc, that will act like stack. Calculation will "pop" arguments from the end of the container, and push new arguments into either end or beginning of container. Normally, the only limit on size of such container is total amount of RAM.

Here is a crude C++ example:

#include <deque>
#include <iostream>

size_t fac(size_t arg){
    std::deque<size_t> v;
    v.push_back(arg);
    while (v.back() > 2)
        v.push_back(v.back() - 1);
    size_t result = 1;
    for (size_t i = 0; i < v.size(); i++)
        result *= v[i];
    return result;
}

int main(int argc, char** argv){
    int arg = 12;
    std::cout << " fac of " << arg << " is " << fac(arg) << std::endl;
    return 0;
}

Less elegant than recursion, but no stackoverflow problem. Technically, we're "emulating" recursion in this case. You can think that stackoverflow is a hardware limitation you have to deal with.

SigTerm
+1. And I don't see how using the heap instead of the stack for recursion is any less limiting, since you can run out of contiguous address space at any time. Both methods have a theoretical limit.
wj32
Recursion unrolling is useful to work-around such problems, but it doesn't make an algorithm without a recursion recursive.
Rotsor
Yes, you can always avoid using recursive stack calls by simulating a stack using an array and lots of extra hand coding. How does that help? Now the problem is your fixed-size array, if it can't be moved, runs out of space by running into some neighboring data structure, and you had to code the problem awkwardly to boot.
Ira Baxter
**@Ira Baxter**, well, the array is not fixed-size. As you can see from the code, **SigTerm** used `std::deque` which is resized dynamically as needed, so the problem is eliminated. Awkwardness still stays, though...
Rotsor
@Rotsor: Resized how? By copying the whole thing? OK, you array has 250Mb in it and finally overflows. Now it has to be recopied. Ooops, page fault city. (You already paid for page faults in filling up the first 250Mb, but now you're doing it again). Hows this going to affect your performance? And if you really want to model "stack" allocation with your array, you have to consider storing "local variables" into your array. What are you going to do if somebody forms a pointer to a "local variable" in the array and then the array moves?
Ira Baxter
@Ira Baxter: I think in C++ they just don't usually use raw pointers, so array moving is not a problem. You could use some more advanced structure (e.g. list of 4-kb arrays) to reduce performance penalty of resizing, but why do we care if this algorithm is not recursive in the first place?
Rotsor
@Rotsor: C++ not use raw pointers? Where'd you get that? C++'s pointers are just like those in C, and those pointers are definitely "raw". If you pick fixed sized chunks like 4K, you'll be sorry when your local variable is 5Kb. .... and if you are going to hand code 4KB sized chunks to avoid moving things, along with all the overhead, why not just switch to heap allocated records? My point: what is discussed in this thread just isn't a practical solution.
Ira Baxter
@Ira Baxter: std::deque is a container optimized for push_back/pop_back operations and indexed access. It is NOT a linear array and it is NOT guaranteed to be a linear block of memory. It isn't std::vector. Try to subtract two deque iterators, and you'll see a difference. See reference http://www.sgi.com/tech/stl/Deque.html http://en.wikipedia.org/wiki/Double-ended_queue
SigTerm
@Ira Baxter: "Resized how? By copying the whole thing? OK, you array has 250Mb" std::deque is not an array, you're confusing it with std::vector. How exactly it resizes itsefl is implementation-dependent. Before arguing about cost of resizing, I suggest you to become more familiar with C++ STL containers. http://www.cplusplus.com/reference/stl/
SigTerm
@Rotsor: "I think in C++ they just don't usually use raw pointers" Good C++ programmer uses pointers when they are necessary. Moving array has cost, but deque is not an "array". It is STL container with array-like element access interface.
SigTerm
@Rotsor: No, there isn't all that much in the way of `Foo *` in modern C++ programs, but the replacements are frequently just wrappers around raw pointers. Iterators are the replacement for pointers in standard containers, but you will find rules on when iterators might be invalidated. Smart pointers are typically raw pointers that handle their own lifetimes. (However, adding to the beginning or end of a `std::deque` doesn't invalidate iterators or pointers to members inside the deque.)
David Thornley
+6  A: 

I can't speak for "major languages". Many "minor" languages do heap-allocated activation records, with each call using a chunk of heap space instead of a linear stack chunk. This allows recursion to go as deep as you have address space to allocate.

Some folks here claim that recursion that deep is wrong, and that using a "big linear stack" is just fine. That isn't right. I'd agree that if you have to use the entire address space, you do a problem of some kind. However, when one has very large graph or tree structures, you want to allow deep recursion and you don't want to guess at how much linear stack space you need first, because you'll guess wrong.

If you decide to go parallel, and you have lots (thousand to million of "grains" [think, small threads]) you can't have 10Mb of stack space allocated to each thread, because you'll be wasting gigabytes of RAM. How on earth could you ever have a million grains? Easy: lots of grains that interlock with one another; when a grain is frozen waiting for a lock, you can't get rid of it, and yet you still want to run other grains to use your available CPUs. This maximizes the amount of available work, and thus allows many physical processors to be used effectively.

The PARLANSE parallel programming language uses this very-large-number of parallel grains model, and heap allocation on function calls. We designed PARLANSE to enable the symbolic analysis and transformation of very large source computer programs (say, several million lines of code). These produce... giant abstract syntax trees, giant control/data flow graphs, giant symbol tables, with tens of millions of nodes. Lots of opportunity for parallel workers.

The heap allocation allows PARLANSE programs to be lexically scoped, even across parallelism boundaries, because one can implement "the stack" as a cactus stack, where forks occur in "the stack" for subgrains, and each grain can consequently see the activation records (parent scopes) of its callers. This makes passing big data structures cheap when recursing; you just reference them lexically.

One might think that heap allocation slows down the program. It does; PARLANSE pays about a 5% penalty in performance but gains the ability to process very large structures in parallel, with as many grains as the address space can hold.

Ira Baxter
Tank you for explaining the problem better than me! And this micro-grained approach looks impressive. A mere 5% performance price is a little bit unbelievable though. Maybe that's because I don't completely understand what "grain" is (I thought of it as a single method call), but anyway great!
Rotsor
If you code the trivial 2 line Fibonacci function, PARLANSE's additional call overhead is rather more visible. Most functions do lots more work then just the function call and return, and so the relatively high overhead compared to a plain CALL instruction isn't really that bad.
Ira Baxter
... Grains aren't method calls; they are PARLANSE's equivalent to threads. Windows (nor Linux) won't let you have a million OS threads, so PARLANSE implements grains by multiplexing one OS thread onto the ready-to-run-threads, in much the same way that the OS multiplexes physical CPUs onto OS threads. We happen to allocate one OS thread per physical CPU, and on an otherwise idle machine, PARLANSE has all the physical processors to itself, one per thread.
Ira Baxter
OK, I see now. Grains are larger than I thought. It explains low penalty.
Rotsor
... Grains don't cause low penalty. Large function bodies mean the ratio of work done by a function to the work required to create/delete activation records is typically modest to small. It takes an X86 typically 1-2 machine instructions to execute a traditional CALL instruction. It takes PARLANSE 4 machine instructions to do a "heap allocated" call. It just isn't that much extra overhead. (One can do better: tail recursion can lower the cost to zero, and and we've considered reasonable techniques to lower the overhead to almost that of a conventional CALL instruction in many cases).
Ira Baxter
+2  A: 

Having practically infinite stack space would be very bad in the case of a infinite recursion because it would turn an easily diagnosed error (stack overflow) into a much more problematic error (out of memory). With a stack overflow, a look at the stack trace will fairly quickly tell you what is going on. Alternately, when the system is out of memory, it may attempt other methods of solving it, such as using swap space, resulting in serious performance degradation.

On the other hand, I have rarely had issues with hitting the stack overflow barrier due to recursion. However, I can think of a couple of circumstance where it happened. However, moving to my own stack implemented as a std::vector was a simple solution to the problem.

Now, what would be neat is if the language would allow me to mark a particular function as "heavily recursive", and then have it operate in its own stack space. That way I'd generally get the advantage of stopping when my recursion is out of whack, but I could still make use of extensive recursion when I wanted to.

Winston Ewert
+1 for catching errors early.
Justin K
+1  A: 

I think we will see this restriction removed in a few years.

There is simply no fundamental technical reason for fixed size stackes. They exist for historical reasons and because the programmers of compilers and VM's are lazy and don't optimize if it is good enough right now.

But GO the google language already starts with a different approach. It allocates the stack in small 4K pieces. There are also many "stackless" programming language extensions like stackless python etc who are doing the same.

The reason for this is quite simple, the more threads you have the more address space is wasted. For programs which are slower with 64bit pointers it is a serious problem. You can't really have more then hundert threads in practice. This is not good if you write a server which might want to server 60000 clients with a thread for each one (wait for the 100 core/cpu systems in the near future).

On 64bit systems it's not so serious but it still requires more resources. For example TLB entries for pages are extremely serious for good performance. If you can satisfy 4000 normal thread stackes with one single TLB entry (given a page size of 16MB and 4KB active stack space) you can see the difference. Don't waste 1020KB just for stack that you almost never use.

Small grained multithreading will be a very very important technique in the future.

Lothar
+2  A: 

I am going to summarize the arguments in the answers so far because I find no answer covering this topic good enough.

Static stack investigation

Motivation

Not everyone needs it.

  • Most algorithms do not use deep recursion or a lot of threads, thus not a lot of people need dynamic stacks.
  • Dynamic stack would make an infinite-recursion stack overflow, which is an easy mistake to make, harder to diagnose. (memory overflow, while being as deadly as a stack overflow to the current process, is hazardous for other processess as well)
  • Every recursive algorithm can be emulated with a similar iterative one.

Implementation difficulties

Dynamic stack implementation turns out to be not as straightforward as it seems.

  • Stack resizing alone is not enough unless you have unlimited address space. You will sometimes need to relocate the stack as well.
  • Stack relocation would require updates for all the pointers to the data structures allocated on the stack. While it is straightforward (at least in managed languages) for the data in memory, there is no easy way to do the same for data in the CPU registers of the thread.
  • Some CPUs (microcontrollers in particular) implement the stack directly on hardware, separate from the main memory.

Existing implementations

There are some languages or runtime libraries that already have the dynamic stack feature or something similar to it.

  • Some runtime libraries (which?) do not pre-commit the entire block of memory allocated for stack. This can alleviate the problem, expecially for 64-bit systems, but not completely eliminate it.
  • Ira Baxter told us about PARLANSE, a language specifically designed for dealing with complex data structures with high degree of parallelism. It uses small heap-allocated "grains" of work instead of stack.
  • fuzzy lolipop told us that "Properly written Erlang doesn't have stackoverflows!"
  • Google Go programming language is said to have a dynamic stack. (a link would be nice)

I would like to see more examples here.

I hope I didn't forget any important pieces of information on this subject. Making this a community wiki so that anyone can add new information.

Rotsor