views:

190

answers:

7

I have a following code in a most inner loop of my program

struct V {
  float val [200]; // 0 <= val[i] <= 1
};

V a[600];
V b[250];
V c[250];
V d[350];
V e[350];

// ... init values in a,b,c,d,e ...

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  for (int ii = 0; ii < 200; ii++) {
    float act_val =
      a[ai].val[ii] +
      b[bi].val[ii] +
      c[ci].val[ii] +
      d[ci].val[ii] +
      e[ci].val[ii];

    if (act_val > best_val) {
      best_val = act_val;
      best_ii = ii;
    }
  }

  return best_ii;
}

I don't care whether it will be some clever algorithm (but this would be most interesting) or some C++ tricks or intrinsics or assembler. But I need to make findmax function more efficient.

Big thanks in advance.

Edit: It seems that branch is the slowest operation (misprediction?).

A: 

You can't really get that much faster than that without additional information about the data (values) stored in a, b, c, d, and e. You have to inspect every sum to determine which one is the greatest.

It get's a little worse for Nth element queries, but fortunately, you didn't ask that one.

MSN
+1  A: 

Try to iterate all vectors at once. Here's the example for two vectors:

for (float *ap = a[ai].val, *bp = b[bi].val; ap - a[ai].val < 200; ap++, bp ++) {
    float act_val = *ap + *bp;
    // check for max and return if necessary
}
Pavel Shved
+1  A: 

I don't see any way to do this without examining each sum, making this an O(n) problem. But since your data are laid out linearly, the Intel/AMD MMX or SSE instructions might help. See this link for Microsoft's implementation of intrinsics:

http://msdn.microsoft.com/en-us/library/y0dh78ez(VS.71).aspx

Mark Ransom
To be specific, you want the addps (packed addition) instruction, which will in effect do 4 float additions simultaneously, dumping the result into an XMM register which amounts to a float[4]. If you store a few of these, you may then also get a gain using maxps (packed max) to do parallel comparisons. Obviously the last few comparisons have to be done with single-float operations rather than SSE.
Steve Jessop
+1  A: 

Well, I see no obvious room for algorithmic optimizations. Theoreticaly one could only calculate the sum of the five vectors until it is obvious that the maximum cannot be reached, but this would add way to much overhead for only summing five numbers. You could try using multiple threads and assign ranges to the threads, but you have to think about the thread creation overhead when you have only 200 very short work items.

So I tend to say that using Assembler and MMX or SSE instructions on x86 or maybe a (machine specific) C++ a library providing access to this instructions is your best bet.

Daniel Brückner
"you have only 200 very short work items." Although he says the code is in a most inner loop, so if he's doing it for a lot of different combinations of ai, bi, etc, then perhaps he could multithread and break the work up at a level higher than this function. Depends whether the vector contents and each set of 5 params depends on the results of previous calculations, or not. Also it's not thread creation overhead so much as thread communication overhead, since you could maintain a pool of worker threads rather than creating them each call.
Steve Jessop
If you're bringing threading in to the equation, you also have to consider whether this will really help, which depends on the greater purpose of the app and where it will be running.
krdluzni
Having said that, multi-threading will never make this algorithm "more efficient", just potentially faster. It will not end up taking fewer CPU cycles/ops to calculate the result. Multi-threading normally only helps if there are idle cores on the machine, which on e.g. a server running lots of apps, there might very well not be.
Steve Jessop
Ah, krdluzni beat me to it.
Steve Jessop
+2  A: 

Unless compiler optimizes them out for you, computing a[ai], etc., in the loop will cost you some time (however slight) given that they are fixed for the duration of findmax. In light of that you might try something like:

int findmax(int ai, int bi, int ci, int di, int ei) {
    float    best_val = std::numeric_limits<float>::min();
    int      best_ii = 0;
    const V& a(a[ai]);
    const V& b(b[bi]);
    const V& c(c[ci]);
    const V& d(d[di]);
    const V& e(e[ei]);

    for (int ii = 0; ii < 200; ++ii) {
        float act_val = a.val[ii] + b.val[ii] + c.val[ii] +
                        d.val[ii] + e.val[ii];

        if (act_val > best_val) {
            best_val = act_val;
            best_ii = ii;
        }
    }

    return best_ii;
}

Other means of improving the code might be to alter the way the data is represented, leading to a different (but much faster) findmax algorithm.

fbrereto
Agreed, there isn't much room for optimization within the function, but perhaps you find the same max multiple times, or the data is laid out in such a way that you can find shortcuts, these are the things that should speed up the overall code.
DeusAduro
Any reasonable compiler would perform this optimization for you automatically.
Mark Ransom
best_val should be initialized to negative infinity
Jason S
+4  A: 

This might help a bit if the compiler is having difficulty short cutting the jumps:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* a_it = &a[ai].val[0]
  float* b_it = &b[bi].val[0]
  float* c_it = &c[ci].val[0]
  float* d_it = &d[di].val[0] // assume typo ci->di
  float* e_it = &e[ei].val[0] // assume typo ci->ei

  for (int ii = 0; ii < 200; ii++) {
    float act_val = *(a_it++) + *(b_it++) + *(c_it++) + *(d_it++) + *(e_it++);
    best_val =  (act_val <= best_val) ? best_val : act_val; // becomes _fsel
    best_ii  =  (act_val <= best_val) ? best_ii : ii; // becomes _fsel
  }

  return best_ii;
}

Generating a sum table might be faster in terms of cache misses I'll post this in a bit:

int findmax(int ai, int bi, int ci, int di, int ei) {
  float best_val = 0.0;
  int best_ii = -1;

  float* its[] = {&a[ai].val[0], &a[bi].val[0], &a[ci].val[0], &a[di].val[0], &a[ei].val[0] };

  V sums;
  for (int ii = 0; ii < 200; ii++) {
    sums.val[ii] = * (++its[0]);
  }

  for (int iter = 1 ; iter < 5; ++iter)  {
      for (int ii = 0; ii < 200; ii++) {
        sums.val[ii] += * (++its[iter]);
      }
    }
  }
  for (int ii = 0; ii < 200; ii++) {
    best_val =  (sums.val[ii] <= best_val) ? best_val : sums.val[ii]; // becomes _fsel
    best_ii  =  (sums.val[ii] <= best_val) ? best_ii : ii; // becomes _fsel
  } 
  return best_ii;
}
Charles Beattie
If you don't fancy my method try the _fsel method of setting bet_val and best_ii
Charles Beattie
+1  A: 

Take a look at loop unwinding (and Duff's device for a specific, but far more complicated, example). Those are the only real algorithm optimizations I can come up with.

Loop_unwinding

Duff's_device

krdluzni
You don't actually need Duff's device when the loop is always the same length (in this case 200). Either use a factor of 200 as the length to unroll, or else use a non-factor but start with a single goto into the middle of the loop.
Steve Jessop
You're right, you don't, but I thought it would serve as an interesting example of unwinding. In all honesty, though, Duff's device has a lot more going on than a regular unwind, and I'm thinking about removing it from my post.
krdluzni
I'm totally in favour of everyone *seeing* Duff's device, though, just so long as they know not to *use* it unless absolutely necessary. And maybe not even then :-)
Steve Jessop