tags:

views:

542

answers:

9

Hi everybody

i made another post

here where I asked how to create the 26 neighbors of a cubic-voxel-node in 3-d space. I got a very good answer and implemented it.

To that I added some MIN MAX Position checking.

I would like to know if there is way, in relationship to the 3 for loops and 4 if used, to improve the execution time of this code. I have read in another post sth that when using while loops is faster but it was in a post not language specific.

Is this true? If yes, could u please help me to this in my code because i luck experience? Is there a way to implement this recursively in a way that will make it faster?

here is my code:

...
std::vector<Pos> Create26Neighbor(Pos somePos, double resol) 
{
    std::vector <Pos> vect1;
    Pos  m_MinPos(0.0,0.0,0.0);
    Pos  m_MaxPos(5.0,4.0,5.0);

    for (double dz = somePos.m_pPos[2] - resol; dz <= somePos.m_pPos[2] + resol; dz+=resol)
    {
     if (dz>m_MinPos.m_pPos[2] && dz<m_MaxPos.m_pPos[2])
     {
      for (double dy = someCPos.m_pPos[1] - resol; dy <= someCPos.m_pPos[1] + resol; dy+=resol)
      {
       if (dy>m_MinPos.m_pPos[1] && dy<m_MaxPos.m_pPos[1])
       {
        for (double dx = somePos.m_pPos[0] - resol; dx <= somePos.m_pPos[0] + resol; dx+=resol)
        {
         if (dx>m_MinPos.m_pPos[0] && dx<m_MaxPos.m_pPos[0])
         {
          // all 27
          if ((dx != somePos.m_pPos[0]) || (dy != somePos.m_pPos[1]) || (dz != somePos.m_pPos[2]))
          {
           Pos tempPos(dx,dy,dz);
           vect1.push_back(tempPos);
          }
         }
        }
       }
      }
     }
    }
    return vect1;
}
....
+2  A: 
  • From a language point of view, you can improve performance by reserving 26 (or 27 depending on what number you mean :)) items in the vector:

    std::vector<Pos> vect1; vect1.reserve(27);

    this will make the internal array large enough and avoids reallocation of the vector.

  • Whether returning the vector, or passing a vector by reference and writing to that is more performant will only be figured out by doing tests. Compilers can optimize away return value copies.

Generally though, you will get more performance boost, if you optimize the algorithm itself (or by chosing another one), than trying to optimize its implementation.

Johannes Schaub - litb
+1  A: 

All of your for loops are essentially in the form:

for (d_ = somPos._ - resol; d_ <= somPos_.+resol; d_+= resol)

This executes exactly 3 times. This code will probably get faster if you replace these three for loops with something for the form:

double dz = somePos.m_pPos[2] - resol; 
for(z = 0; z < 3; z++, dz += resol)

Using the common for loop form here will allow the optimizer to unroll these loops if it wants. I don't think the other form you had was simple enough for the optimizer to figure out it is really only going to happen 3 times. This one is.

Edit: Also if you use const or #define for your MinPos/MaxPos values the compiler may be able to speed things us a little bit. I don't think it will be able to figur out that values are really constants the way you have them.

SoapBox
There is a bug in the original version. Forever loop if the resol is zero or very very small value.
Dennis Cheung
+1  A: 

Comparing equality with floating point numbers is very risky and prone to error.

Passing and returning objects by value? Depending on your objects, this may slow things.

As far as optimization, test variables in the most "outer" loop as possible. But really, it seems that you have far larger problems than loop optimization to worry about.

HUAGHAGUAH
The vector objects are copied by push_back, there's nothing wrong with using stack allocated objects in a push_back or with returning a vector by value.
SoapBox
returning the vector by value could drop performance. but it could also be that it's not dropping performance at all. return value optimization is a fairly trivial optimization
Johannes Schaub - litb
passing the pos by value is fully correct. passing it by reference could even drop performance (assuming Pos is just 4 floats without a user defined copy constructor)
Johannes Schaub - litb
Fair enough, it's been too long since I've putzed with C++.
HUAGHAGUAH
@litb: Why does it matter whether it has a user-defined cctor? If it's user defined but can be inlined, where's the harm?
Steve Jessop
It's not inlining that's the issue. Return value optimization entirely eliminates a copy, and that's obviously only safe to do if the compiler can determine that the copy constructor didn't do anything important in the first place.
jalf
@jalf: no, as far as I remember RVO is explicitly exempt from the rules governing copy constructor invocation, i.e. the compiler is allowed to break semantics here in order to prevent an unnecessary copy. This is true regardless of the type of object!
Konrad Rudolph
Konrad is right. It is allowed to omit the call even if the call to the cctor would have side effects in *function returns*. However, in parameter passing, it can only omit it if the argument was a temporary (indeed, otherwise it would change the passed in argument too)
Johannes Schaub - litb
onebyone, yes indeed. but what if the cctor is complicated. we don't know how the cctor looks like. so i restricted to when it doesn't define a user defined copy ctor. this doesn't mean that defining it will drop performance.
Johannes Schaub - litb
+5  A: 

First, get rid of the if statements. There's no need for them. You can merge them into the loop condition. Second, avoid recomputing the loop condition every iteration. Yes, the compiler may optimize it away, but it's generally very conservative with floating-point optimizations (and it may treat fp values read from memory differently from ones read from a register, which means it can't eliminate your array lookups from the loop conditions), so it's often best to do even simple optimizations manually:

std::vector<Pos> Create26Neighbor(Pos somePos, double resol) 
{
    std::vector <Pos> vect1(27); // Initialize the vector with the correct size.
    Pos  m_MinPos(0.0,0.0,0.0);
    Pos  m_MaxPos(5.0,4.0,5.0);

    double minz = std::max(somePos.m_pPos[2] - resol, m_MinPos.m_pPos[2]);
    double maxz = std::min(somePos.m_pPos[2] + resol, m_MaxPos.m_pPos[2];
    int i = 0;
    for (double dz = min; dz <= max; dz+=resol)
    {
        double miny = std::max(somePos.m_pPos[1] - resol, m_MinPos.m_pPos[1]);
        double maxy = std::min(somePos.m_pPos[1] + resol, m_MaxPos.m_pPos[1];
        for (double dy = miny; dy <= maxy; dy+=resol)
        {
            double minx = std::max(somePos.m_pPos[0] - resol, m_MinPos.m_pPos[0]);
            double maxx = std::min(somePos.m_pPos[0] + resol, m_MaxPos.m_pPos[0];

            for (double dx = minx; dx <= maxx; dx+=resol)
            {
                ++i;
                // If we're not at the center, just use 'i' as index. Otherwise use i+1
                int idx = (dx != somePos.m_pPos[0] || dy != somePos.m_pPos[1] || dz != somePos.m_pPos[2]) ? i : i+1;
                vec1[idx] = Pos(dx, dy, dz); // Construct Pos on the spot, *might* save you a copy, compared to initilizing it, storing it as a local variable, and then copying it into the vector.
              }
        }
    }
    return vect1;
}

The last point I'd consider looking at is the inner if-statement. Branches in a tight loop can be more costly than you might expect. I can think of a number of ways to eliminate it:

  • As I sketched in the code, the ?: operator could be coaxed into calculating a different vector index for the center value (so it's written to the next vector element, and so gets overwritten again next iteration). This would eliminate the branch, but may or may not be faster overall.
  • Split up the loops so you have separate loops before and after the 'resol' value. That gets a bit awkward, with a whole lot of smaller loops, and may be less efficient overall. But it would eliminate the inner if-statement, so it may also be faster.
  • Allow the center point to be added to the vector, and either ignore it afterwards, or remove it after the loops (that'd be a somewhat expensive operation, and may or may not pay off. It may be cheaper if you use a deque instead of vector.

And make sure the compiler unrolls the inner loop. Manually unrolling it may help too.

Finally, a lot depends on how Pos is defined.

Note that most of what I suggested is qualified by "it may not be faster, but...". You have to constantly profile and benchmark every change you make, to ensure you're actually improving performance.

Depending on how far you're willing to go, you may be able to merge everything into one loop (running on integers) and just compute the Pos coordinates on the fly in each iteration.

jalf
not sure whether the calculation of idx is correct (probably I didn't understand it ;) )I guess replacing the double loops by int loops (range -1..1) and comparing int loop indexes rather than double values would speed up the code, too
devio
Oops, had the two last statements in th e? : operators swapped. Might mkae a bit more sense now. :)
jalf
thanks for the answer , but tried it and doesnt work (linking error).. :(
dieter
Well, I didn't test it. Don't just copy/paste the code, understand what changes I made and why, and then code it yourself so it works. ;)
jalf
+3  A: 

You are probably not going to find many ways of simplifying a cubic equation like that without putting in "smarts" like domain filtering.

Really, my real reason for posting here is that code is frankly beasty, ie: eugh. yuck. I have a personal and recently developed hate for over-nested code, and would endeavor to export some of those inner loops to a separate function, DESPITE the additional theoretical overhead it would add ( profile it, the small functions will usually inline anyway )

My personal opinion is if you have code that is performant, but nobody can understand it, its worse than code that is sub-optimal, but maintainable.


Also, if you can guarantee that the number of coordinates will be fixed relative to the start point, you may be able to benefit by hard-coding the structure, that is, do it by hand, ie:

function generate26( x,y,z ){ 
   return [ 
   # Top 
     # Left
      [x-1,y+1,z-1], 
      [x-1,y+1,z],
      [x-1,y+1,z+1]
   ];
}

Or generate a macro or 2 to do it for you.

At least that way you're relying up entirely on the compilers ability to optimize the structure in memory, no loops or anything. ( Profile it though, to be sure )

Kent Fredric
Yes. If it applies to the OP's situation, this generate26 function is probably the best possible optimization. No looping, no conditionals, and eminently readable. I only regret that I have but one upvote to give.
Dave Sherohman
I don't know how readable that'd get if he wants all 26 neighbors in 3d. That's a lot of hardcoded indexing. Bit in simple cases, I agree, this would be cleaner.
jalf
@jalf, you could easily create a simple construct to generate some included file I guess. Run once, and inline sort of deal.
Kent Fredric
+2  A: 

Is there a way to implement this recursively in a way that will make it faster?

No. Really, no.

Recursion means function calls, usually in large numbers. Function calls mean stack manipulation and (potentially) context changes, which are relatively slow operations.

Recursion is a powerful tool that can be used to do some very tricky things while remaining readable, but it is not a high-performance technique. In the best case, you might find a compiler which optimizes tail recursion to run as fast as a normal loop - and this is accomplished by converting the recursive code into a normal loop behind the scenes.

Dave Sherohman
+1  A: 

So basically, in the normal case, you want to add 26 positions to the vector, and those could be easily enumerated, except that you have to be carefull not to access voxels that are out-of bounds.

If you really, really, want to optimize this function to the max, the most optimal implementation would be a single switch, and unrolled loops.

For each of the 3 dimensions, there are only five possibilites:

case 1:  {somePos[i] - resol};  // 1 value only
case 2:  {somePos[i] - resol, somePos[i]}   // 2 values
case 3:  {somePos[i] - resol, somePos[i], somePos[i] + resol}  // all 3
case 4:                      {somePos[i], somePos[i] + resol}  // 2 values again
case 5:                                  {somePos[i] + resol}  // 1 value only

There's also a "case 0" where none of the values are in range. But if that's true for any of the dimensions, then you don't add any values at all.

Combining the 5 posssiblities for each of three dimensions, gives you 125 possible cases to implement. Given which of the 125 cases you've got, you can then just unroll the loops and ifs into a sequence of up to 26 push_back() calls.

Something like this:

enum eCase {
CASE_NONE = 0,
CASE_LOW1 = 1,
CASE_LOW2 = 2,
CASE_ALL3 = 3,
CASE_HIGH2 = 4,
CASE_HIGH1 = 5,
};

eCase Xcase = /* a function of somePos[0], m_MinPos[0], m_MaxPos[0], and resol */
eCase Ycase = ...
eCase Zcase = ...

#define MUNGE(_x,_y,_z) (((((_x)*6)+(_y))*6)+(_z))
switch (MUNGE(Xcase, Ycase, Zcase) {

default:
    break; // all CASE_NONE's do nothing
case MUNGE (CASE_ALL3, CASE_ALL3, CASE_ALL3):
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1] - resol, somePos.m_pPos[2] - resol));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1] - resol, somePos.m_pPos[2]        ));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1] - resol, somePos.m_pPos[2] + resol));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1]        , somePos.m_pPos[2] - resol));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1]        , somePos.m_pPos[2]        ));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1]        , somePos.m_pPos[2] + resol));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1] + resol, somePos.m_pPos[2] - resol));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1] + resol, somePos.m_pPos[2]        ));
    vect1.push_back( pos (somePos.m_pPos[0] - resol, somePos.m_pPos[1] + resol, somePos.m_pPos[2] + resol));

    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1] - resol, somePos.m_pPos[2] - resol));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1] - resol, somePos.m_pPos[2]        ));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1] - resol, somePos.m_pPos[2] + resol));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1]        , somePos.m_pPos[2] - resol));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1]        , somePos.m_pPos[2] + resol));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1] + resol, somePos.m_pPos[2] - resol));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1] + resol, somePos.m_pPos[2]        ));
    vect1.push_back( pos (somePos.m_pPos[0]        , somePos.m_pPos[1] + resol, somePos.m_pPos[2] + resol));


vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1] - resol, somePos.m_pPos[2] - resol));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1] - resol, somePos.m_pPos[2]        ));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1] - resol, somePos.m_pPos[2] + resol));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1]        , somePos.m_pPos[2] - resol));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1]        , somePos.m_pPos[2]        ));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1]        , somePos.m_pPos[2] + resol));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1] + resol, somePos.m_pPos[2] - resol));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1] + resol, somePos.m_pPos[2]        ));
vect1.push_back( pos (somePos.m_pPos[0] + resol, somePos.m_pPos[1] + resol, somePos.m_pPos[2] + resol));
break;

... only 124 more cases to go!

DO NOT -- repeat DO NOT actually write all of this code by hand!!! There is no way any human could do that without coding a hard-to-find bug. Write another program to write source code instead. :-)

Die in Sente
My bad: there are Six interesting cases in each dimension, not just five. you can have the case of middle-only. That makes a total of 216 cases to implement!
Die in Sente
A: 
std::vector<Pos> Create26Neighbor(Pos somePos, double resol) 
{
    std::vector<Pos> vect1(26);
    Pos  m_MinPos(0.0,0.0,0.0);
    Pos  m_MaxPos(5.0,4.0,5.0);

    double z = somePos.m_pPos[2] - resol;

    for(int dz = -1; dz <= 1; ++dz) {
        z += resol;
        if(z <= m_MinPos.m_pPos[2] || z >= m_MaxPos.m_pPos[2])
            continue;

        double y = somePos.m_pPos[1] - resol;

        for(int dy = -1; dy <= 1; ++dy) {
            y += resol;
            if(y <= m_MinPos.m_pPos[1] || y >= m_MaxPos.m_pPos[1])
                continue;

            double x = somePos.m_pPos[0] - resol;

            for(int dx = -1; dx <= 1; ++dx) {
                x += resol;

                if(dx == 0 && dy == 0 && dz == 0)
                    continue;

                if(x <= m_MinPos.m_pPos[0] || x >= m_MaxPos.m_pPos[0])
                    continue;

                vect1.push_back(Pos(x, y, z));
            }
        }
    }

    return vect1;
}

I tried to optimize this for readability. Do you really care about speed? I wouldn't think speed is that important for creating some neighbor nodes. Have you profiled your code to see if this is the bottleneck?

strager
A: 

I haven't tried to figure it out, but you may be able to do some nifty things with SSE2/Altivec/other vector instructions to do multiple compares at once.

Zan Lynx