views:

669

answers:

8

Hi,

I'm writing a small ray tracer using bounding volume hierarchies to accelerate ray tracing. Long story short, I have a binary tree and I might need to visit multiple leafs.

Current I have a node with two children left and right, then during travel() if some condition, in this example intersect(), the children are visited:

class BoundingBoxNode{
    BoundingBoxNode* left, *right;
    void travel(param &p);
    inline bool intersect(param &p){...};
};

void BoundingBoxNode::travel(param &p){
    if(this->intersect(p)){
     if(left)
      left->travel(p);
     if(right)
      right->travel(p);
    }
}

This approach uses recursive methods calls, however, I need to optimized this code as much as possible... And according to Optimization Reference Manual for IA-32, function calls deeper than 16 can be very expensive, so I would like to do this using a while loop instead of recursive calls. But I do NOT wish to do dynamic heap allocations as these are expensive. So I was thinking that maybe I could use that fact that every time a while loop starts over the stack will be in the same position. In the following very ugly hack I rely on alloca() to always allocate the same address:

class BoundingBoxNode{
    BoundingBoxNode* left, right;
    inline void travel(param &p){
     int stack_size = 0;
     BoundingBoxNode* current = this;
     while(stack_size >= 0){
      BoundingBoxNode* stack = alloca(stack_size * 4 + 2*4);
      if(current->intersect(p)){
       if(current->left){
        stack[stack_size] = current->left;
        stack_size++;
       }
       if(current->right){
        stack[stack_size] = current->right;
        stack_size++;
       }
      }
      stack_size--;
      current = stack[stack_size];
     }
    };
    inline bool intersect(param &p){...};
};

However surprising it may seem this approach does fail :) But it does work as long as the stack is smaller than 4 or 5... I'm also quite confident that this approach is possible, I just really think I need some help implementing it correctly.

So how can I manipulate the stack manually from C++, is it possible that I can use some compiler extension... Or must I do this is assembler, and if so, how do I write assembler than can be compiled with both GCC and ICC.

I hope somebody can help me... I don't need a perfect solution, just a hack, if it works it's good enough for this purpose :)

Regards Jonas Finnemann Jensen

A: 

The C++ Standard provides no means of manipulating the stack - it doesn't even require that there be a stack. Have you actually measured the performance of your code using dynamic allocation?

anon
I know... Which is probably why the hack I tried gave segfault after running for a while...And I need not a standardized solution, just a hack that works :)
jopsen
I notice you didn't answer my question regarding dynamic allocation performance.
anon
Just tried using std::list and FPS (frames per second) drops 50% compared to function recursion, which is about the same as a large static allocation on stack...
jopsen
+2  A: 

Stack allocations cannot be resized.

In your example, it isn't immediately obvious which data you need to allocate - besides the call stack itself. You could basically hold the current path in a vector preallocated to the maximum depth. The loop gets ugly, but that's life...

If you need many small allocations that can be released as a whole (after the algorithm completes), use a continuous pool for your allocations.

If you know an upper boundary for the required memory, the allocation is just a pointer increment:

class CPool
{
    std::vector<char> m_data;
    size_t m_head;
  public:
    CPool(size_t size) : m_data(size()), m_head(0) {}
    void * Alloc(size_t size)
    {
      if (m_data.size() - head < size)
        throw bad_alloc();
      void * result = &(m_data[m_head]);
      m_head += size;      
      return result;
    }
    void Free(void * p) {} // free is free ;)
};

If you don't have an upper boundary for the total size, use "pool on a rope" - i.e. when the big chunk of memory runs out, get a new one, and put these chunks in a list.

peterchen
There is no data besides the two pointers, one of which shall be traveled in next iteration...I just don't want my CPU to do "call" and "return" instructions, as a simple goto statement could do the trick... if I could just store a pointer on the stack :)
jopsen
As said, if you know the maximum depth, it is a single allocation per search. That shouldn't be to much.
peterchen
A: 

Hi Jonas,

Since alloca allocations are cummulative, I suggest you do a first alloca to store the "this" pointer, thus becoming the "base" of the stack, keep track of how many elements your stack can hold and allocate only the size needed:

inline void travel(param &p){

    BoundingBoxNode* stack = alloca(sizeof(BoundingBoxNode*)*3);
    int stack_size = 3, stack_idx = 0;
    stack[stk_idx] = this;

    do {
            BoundingBoxNode* current = stack[stk_idx];
            if( current->intersect(p)){
                    int need = current->left ? ( current->right ? 2 : 1 ) : 0;
                    if ( stack-size - stk_idx < need ) {
                         // Let us simplify and always allocate enough for two
                         alloca(sizeof(BoundingBoxNode*)*2);
                         stack_size += 2;
                    }
                    if(current->left){
                            stack[stack_idx++] = current->left;
                    }
                    if(current->right){
                            stack[stack_idx++] = current->right;
                    }
            }
            stack_idx--;
    } while(stack_idx > 0)
};
njsf
A: 

The fact that it works with small stack sizes is probably a coincidence. You'd have to maintain multiple stacks and copy between them. You're never guaranteed that successive calls to alloca will return the same address.

Best approach is probably a fixed size for the stack, with an assert to catch overflows. Or you could determine the max stack size from the tree depth on construction and dynamically allocate a stack that will be used for every traversal... assuming you're not traversing on multiple threads, at least.

Dan Olson
+4  A: 

So, you've got a recursive function that you want to convert to a loop. You correctly work out that your function is not tail call so you have to implement it with a stack.

Now, why are you worried about the number of times that you allocate your "scratch space" stack? Is this not done once per traversal? -- if not then pass the scratch area in to the traverse function itself so it can be allocated once and then re-used for each traversal.

If the stack is small enough to fit in the cache it will stay hot and the fact that it isn't on the real C++ stack won't matter.

Once you've done all of that profile it both ways and see if it made any difference -- keep the faster version.

KayEss
+1  A: 

You don't need the stack, you just need a stack. You can probably use a std::stack<BoundingBoxNode* >, if I look at your code.

MSalters
I can but std::stack allocates it memory on the heap, which is expensive... Did try using std::list and it eats 50% performance...
jopsen
note that std:;stack, which uses a deque as its underlying storage, is probably more quite a bit more efficient than std:;list
anon
@jopsen: Don't believe the heap stories, nor dismiss them out of hand. You would have a problem if you have one heap allocation per traversal (O(N)). std::stack will likely need O(log(stackdepth)) which for random trees is O(log(N)), so your total amount of allocations is O(log(log(N)) - quite possibly 5.
MSalters
A: 

From your question, it appears there is a lot that still needs to be learned.

The most important thing to learn is: don't assume anything about performance without first measuring your runtime execution and analysing the results to determine exactly where the bottlenecks to performance are.

The function 'alloca' allocates a chunk of memory from the stack, the stack size is increased (by moving the stack pointer). Each call to 'alloca' creates a new chunk of memory until you run out of stack space, it does not re-use previously allocated memory, the data that was pointed to by 'stack' is lost when you allocate another chunk of memory and assign it to 'stack'. This is a memory leak. In this case, the memory is automatically freed when the function exits so it's not a serious leak, but, you've lost the data.

I would leave the "Optimization Reference Manual for IA-32" well alone. It assumes you know exactly how the CPU works. Let the compiler worry about optimisations it will do a good enough job for what you're doing - the compiler writers hopefully know that reference inside out. With modern PC's, the common bottleneck to performance is usually memory bandwidth.

I believe the '16 deep' function calls being expensive is to do with how the CPU is managing its stack and is a guideline only. The CPU keeps the top of the stack in onboard cache, when the cache is full, the bottom of the stack is paged to RAM which is where the performance starts to decrease. Functions with lots of arguments won't nest as deeply as functions with no arguments. And it's not just function calls, it's also local variables and memory allocated using alloca. In fact, using alloca is probably a performance hit since the CPU will be designed to optimise its stack for common use cases - a few parameters and a few local variables. Stray from the common case and performance drops off.

Try using std::stack as MSalters has suggested above. Get that working.

Skizz
std::stack does heap allocations, then recursive calls will be faster... std::stack uses some sort of linked list, and considering that I'm only storing pointers I'll be allocating twice the memory I need...But maybe you're right, perhaps I should just leave it to the compiler... But it shouldn't be impossible to manipulate the stack manually...
jopsen
Looking at the code, your bottleneck is more likely to be in the intersect method. But you must profile the code to make sure. I don't think the recursive version will cause a performance problem since pushing/popping the CPU stack is probably quicker than using dynamic storage (such as std::stack). It will also be easier to understand. Stack manipulation is limited to alloca because the language standard doesn't dictate how the CPU stack works - different CPUs implment stacks in different ways.
Skizz
Theoretically pushing/popping from the CPU stack should be possible in a while loop... I'm well aware that the language does not offer this... But I was hoping there was some hack, assembler or other thing that could be used to achieve this...From what I read on most of the other replies I probably did formulate my problem clearly enough... :)
jopsen
A: 

Use a C++ data structure. You are using C++ after all. A std::vector<> can be pre-allocated in chunks for an amortized cost of pretty much nil. And it's safe too (as you have noticed that using the normal stack is not. Especially when you're using threads)

And no, it's not expensive. It's as fast as a stack allocation.

std::list<> yes, that will be expensive. But that's because you can't pre-allocate that. std::vector<> is chunk-allocated by default.

Thomas