views:

147

answers:

2

Suppose you want to compute the sum of the square of the differences of items:

$\sum_{i=1}^{N-1} (x_i - x_{i+1})^2$

the simplest code (the input is std::vector<double> xs, the ouput sum2) is:

double sum2 = 0.;
double prev = xs[0];
for (vector::const_iterator i = xs.begin() + 1;
 i != xs.end(); ++i)
{
sum2 += (prev - (*i)) * (prev - (*i)); // only 1 - with compiler optimization
prev = (*i);
}

I hope that the compiler do the optimization in the comment above. If N is the length of xs you have N-1 multiplications and 2N-3 sums (sums means + or -).

Now suppose you know this variable:

$x_1^2 + x_N^2 + 2 \sum_{i=2}^{N-1} x_i^2$

and call it sum. Expanding the binomial square:

$sum_i^{N-1} (x_i-x_{i+1})^2 = sum - 2\sum_{i=1}^{N-1} x_i x_{i+1}$

so the code becomes:

double sum2 = 0.;
double prev = xs[0];
for (vector::const_iterator i = xs.begin() + 1;
 i != xs.end(); ++i)
{
sum2 += (*i) * prev;
prev = (*i);
}
sum2 = -sum2 * 2. + sum;

Here I have N multiplications and N-1 additions. In my case N is about 100.

Well, compiling with g++ -O2 I got no speed up (I try calling the inlined function 2M times), why?

+2  A: 

The multiplications are much more costly than additions in term of execution time. Also, depending on the processor additions and multiplications will be done in parallel. Ie. it will start the next multiplication while it's doing the addition (see http://en.wikipedia.org/wiki/Out-of-order_execution).

So reducing the number of additions will not help much for the performance.

What you can do is make it easier for the compiler to vectorize your code, or vectorize by yourself. To make it easier for the compiler to vectorize, I would use a regular array of doubles, use subscripts and not pointers.

EDIT: N = 100 might also be a small number to see the difference in execution time. Try a N bigger.

Dirty code but shows the perf improvement. Output:

1e+06
59031558
1e+06
18710703

The speedup you get is ~3x.

#include <vector>
#include <iostream>

using namespace std;

unsigned long long int rdtsc(void)
{
  unsigned long long int x;
  unsigned a, d;

  __asm__ volatile("rdtsc" : "=a" (a), "=d" (d));

  return ((unsigned long long)a) | (((unsigned long long)d) << 32);;
}



double f(std::vector<double>& xs)
{
  double sum2 = 0.;
  double prev = xs[0];

  vector<double>::const_iterator iend = xs.end();
  for (vector<double>::const_iterator i = xs.begin() + 1;
       i != iend; ++i)
    {
      sum2 += (prev - (*i)) * (prev - (*i)); // only 1 - with compiler optimization
      prev = (*i);
    }

  return sum2;
}

double f2(double *xs, int N)
{
  double sum2 = 0;

  for(int i = 0; i < N - 1; i+=1) {
    sum2 += (xs[i+1] - xs[i])*(xs[i+1] - xs[i]);

  }

  return sum2;
}

int main(int argc, char* argv[])
{
  int N = 1000001;
  std::vector<double> xs;
  for(int i=0; i<N; i++) {
    xs.push_back(i);
  }

  unsigned long long int a, b;
  a = rdtsc();
  std::cout << f(xs) << endl;
  b = rdtsc();
  cout << b - a << endl;

  a = rdtsc();
  std::cout << f2(&xs[0], N) << endl;
  b = rdtsc();
  cout << b - a << endl;
}
Kamchatka
N is fixed by the problem
wiso
Right. But if you can, it's easier to have N bigger to check speedups.
Kamchatka
a) why do you use i+=1 instead of ++i?b) the speed up is because you are using plain array instead of std::vector or because you are not using the `prev` variable?
wiso
a) because I tried loop unrolling and i+=4. it didn't give much more speedup.b) the speedup is because I use plain array mostly (you can try by removing the prev in your example). Also, compilers are doing a best job if you use standard constructs (though it's not a general rule). If this is possible, you should try ICC (Intel C Compiler). There is a free version for personal use and it is really good a loop analysis.
Kamchatka
with g++ -O2 I don't get your improvement: 1e+06->7422819, 1e+06->7186527
wiso
interesting. I used g++ without flags which is generally g++ -O2. What architecture are you runnning on ?
Kamchatka
linux x86_64, gcc 4. I think -O2 is not the default. From man gcc "-O0 Reduce compilation time and make debugging produce the expected results. This is the default."
wiso
+1  A: 

Addition can be free when done as x+=a*b. The compiler should be able to figure that out in the first version, if the architecture supports it.

The math is probably happening in parallel with *i which could be slower.

Do not call xs.end() at every loop iteration unless you expect the return value to change. If the compiler cannot optimize it out it will dwarf the rest of the loop.

drawnonward