ansaurus

Question

Resizing dynamic stack allocations in C++

Answer 1

A:

The C++ Standard provides no means of manipulating the stack - it doesn't even require that there be a stack. Have you actually measured the performance of your code using dynamic allocation?

anon 2009-06-02 09:11:37

I know... Which is probably why the hack I tried gave segfault after running for a while...And I need not a standardized solution, just a hack that works :)

jopsen 2009-06-02 09:14:28

I notice you didn't answer my question regarding dynamic allocation performance.

anon 2009-06-02 09:15:50

Just tried using std::list and FPS (frames per second) drops 50% compared to function recursion, which is about the same as a large static allocation on stack...

jopsen 2009-06-02 09:56:28

Answer 2

+2 A:

Stack allocations cannot be resized.

In your example, it isn't immediately obvious which data you need to allocate - besides the call stack itself. You could basically hold the current path in a vector preallocated to the maximum depth. The loop gets ugly, but that's life...

If you need many small allocations that can be released as a whole (after the algorithm completes), use a continuous pool for your allocations.

If you know an upper boundary for the required memory, the allocation is just a pointer increment:

class CPool
{
    std::vector<char> m_data;
    size_t m_head;
  public:
    CPool(size_t size) : m_data(size()), m_head(0) {}
    void * Alloc(size_t size)
    {
      if (m_data.size() - head < size)
        throw bad_alloc();
      void * result = &(m_data[m_head]);
      m_head += size;      
      return result;
    }
    void Free(void * p) {} // free is free ;)
};

If you don't have an upper boundary for the total size, use "pool on a rope" - i.e. when the big chunk of memory runs out, get a new one, and put these chunks in a list.

peterchen 2009-06-02 09:29:28

There is no data besides the two pointers, one of which shall be traveled in next iteration...I just don't want my CPU to do "call" and "return" instructions, as a simple goto statement could do the trick... if I could just store a pointer on the stack :)

jopsen 2009-06-02 10:05:20

As said, if you know the maximum depth, it is a single allocation per search. That shouldn't be to much.

peterchen 2009-06-02 11:29:50

Answer 3

A:

Hi Jonas,

Since alloca allocations are cummulative, I suggest you do a first alloca to store the "this" pointer, thus becoming the "base" of the stack, keep track of how many elements your stack can hold and allocate only the size needed:

inline void travel(param &p){

    BoundingBoxNode* stack = alloca(sizeof(BoundingBoxNode*)*3);
    int stack_size = 3, stack_idx = 0;
    stack[stk_idx] = this;

    do {
            BoundingBoxNode* current = stack[stk_idx];
            if( current->intersect(p)){
                    int need = current->left ? ( current->right ? 2 : 1 ) : 0;
                    if ( stack-size - stk_idx < need ) {
                         // Let us simplify and always allocate enough for two
                         alloca(sizeof(BoundingBoxNode*)*2);
                         stack_size += 2;
                    }
                    if(current->left){
                            stack[stack_idx++] = current->left;
                    }
                    if(current->right){
                            stack[stack_idx++] = current->right;
                    }
            }
            stack_idx--;
    } while(stack_idx > 0)
};

njsf 2009-06-02 09:34:01

Answer 4

A:

The fact that it works with small stack sizes is probably a coincidence. You'd have to maintain multiple stacks and copy between them. You're never guaranteed that successive calls to alloca will return the same address.

Best approach is probably a fixed size for the stack, with an assert to catch overflows. Or you could determine the max stack size from the tree depth on construction and dynamically allocate a stack that will be used for every traversal... assuming you're not traversing on multiple threads, at least.

Dan Olson 2009-06-02 09:35:54

Answer 5

+4 A:

So, you've got a recursive function that you want to convert to a loop. You correctly work out that your function is not tail call so you have to implement it with a stack.

Now, why are you worried about the number of times that you allocate your "scratch space" stack? Is this not done once per traversal? -- if not then pass the scratch area in to the traverse function itself so it can be allocated once and then re-used for each traversal.

If the stack is small enough to fit in the cache it will stay hot and the fact that it isn't on the real C++ stack won't matter.

Once you've done all of that profile it both ways and see if it made any difference -- keep the faster version.

KayEss 2009-06-02 09:39:06

Answer 6

+1 A:

You don't need the stack, you just need a stack. You can probably use a std::stack<BoundingBoxNode* >, if I look at your code.

MSalters 2009-06-02 10:01:20

I can but std::stack allocates it memory on the heap, which is expensive... Did try using std::list and it eats 50% performance...

jopsen 2009-06-02 10:07:20

note that std:;stack, which uses a deque as its underlying storage, is probably more quite a bit more efficient than std:;list

anon 2009-06-02 10:27:33

@jopsen: Don't believe the heap stories, nor dismiss them out of hand. You would have a problem if you have one heap allocation per traversal (O(N)). std::stack will likely need O(log(stackdepth)) which for random trees is O(log(N)), so your total amount of allocations is O(log(log(N)) - quite possibly 5.

MSalters 2009-06-02 10:48:12

Answer 7

A:

From your question, it appears there is a lot that still needs to be learned.

The most important thing to learn is: don't assume anything about performance without first measuring your runtime execution and analysing the results to determine exactly where the bottlenecks to performance are.

The function 'alloca' allocates a chunk of memory from the stack, the stack size is increased (by moving the stack pointer). Each call to 'alloca' creates a new chunk of memory until you run out of stack space, it does not re-use previously allocated memory, the data that was pointed to by 'stack' is lost when you allocate another chunk of memory and assign it to 'stack'. This is a memory leak. In this case, the memory is automatically freed when the function exits so it's not a serious leak, but, you've lost the data.

I would leave the "Optimization Reference Manual for IA-32" well alone. It assumes you know exactly how the CPU works. Let the compiler worry about optimisations it will do a good enough job for what you're doing - the compiler writers hopefully know that reference inside out. With modern PC's, the common bottleneck to performance is usually memory bandwidth.

I believe the '16 deep' function calls being expensive is to do with how the CPU is managing its stack and is a guideline only. The CPU keeps the top of the stack in onboard cache, when the cache is full, the bottom of the stack is paged to RAM which is where the performance starts to decrease. Functions with lots of arguments won't nest as deeply as functions with no arguments. And it's not just function calls, it's also local variables and memory allocated using alloca. In fact, using alloca is probably a performance hit since the CPU will be designed to optimise its stack for common use cases - a few parameters and a few local variables. Stray from the common case and performance drops off.

Try using std::stack as MSalters has suggested above. Get that working.

Skizz 2009-06-02 10:23:29

std::stack does heap allocations, then recursive calls will be faster... std::stack uses some sort of linked list, and considering that I'm only storing pointers I'll be allocating twice the memory I need...But maybe you're right, perhaps I should just leave it to the compiler... But it shouldn't be impossible to manipulate the stack manually...

jopsen 2009-06-02 10:41:38

Looking at the code, your bottleneck is more likely to be in the intersect method. But you must profile the code to make sure. I don't think the recursive version will cause a performance problem since pushing/popping the CPU stack is probably quicker than using dynamic storage (such as std::stack). It will also be easier to understand. Stack manipulation is limited to alloca because the language standard doesn't dictate how the CPU stack works - different CPUs implment stacks in different ways.

Skizz 2009-06-02 11:05:25

Theoretically pushing/popping from the CPU stack should be possible in a while loop... I'm well aware that the language does not offer this... But I was hoping there was some hack, assembler or other thing that could be used to achieve this...From what I read on most of the other replies I probably did formulate my problem clearly enough... :)

jopsen 2009-06-02 20:30:14

Answer 8

A:

Use a C++ data structure. You are using C++ after all. A std::vector<> can be pre-allocated in chunks for an amortized cost of pretty much nil. And it's safe too (as you have noticed that using the normal stack is not. Especially when you're using threads)

And no, it's not expensive. It's as fast as a stack allocation.

std::list<> yes, that will be expensive. But that's because you can't pre-allocate that. std::vector<> is chunk-allocated by default.

Thomas 2009-06-02 12:55:19

ansaurus

tags:

views:

answers:

Resizing dynamic stack allocations in C++

related questions