views:

139

answers:

4

I have some code in a loop

for(int i = 0; i < n; i++)
{
  u[i] = c * u[i] + s * b[i];
}

So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?

UPDATE

I learnt vectorization (turns out it's not so hard if you use intrinsics) and implemented my loop in SSE. However, when setting the SSE2 flag in the VC++ compiler, I get about the same performance as with my own SSE code. The Intel compiler on the other hand was much faster than my SSE code or the VC++ compiler.

Here is the code I wrote for reference

double *u = (double*) _aligned_malloc(n * sizeof(double), 16);
for(int i = 0; i < n; i++)
{
   u[i] = 0;
}

int j = 0;
__m128d *uSSE = (__m128d*) u;
__m128d cStore = _mm_set1_pd(c);
__m128d sStore = _mm_set1_pd(s);
for (j = 0; j <= i - 2; j+=2)
{
  __m128d uStore = _mm_set_pd(u[j+1], u[j]);

  __m128d cu = _mm_mul_pd(cStore, uStore);
  __m128d so = _mm_mul_pd(sStore, omegaStore);

  uSSE[j/2] = _mm_add_pd(cu, so);
}
for(; j <= i; ++j)
{
  u[j] = c * u[j] + s * omegaCache[j];
}
A: 

it depends on how you placed u and b in memory. if both memory block are far from each other, SSE wouldn't boost much in this scenario.

it is suggested that the array u and b are AOE (array of structure) instead of SOA (structure of array), because you can load both of them into register in single instruction.

YeenFei
I disagree that using an AOS here will be advantageous over an SOA. You're still doing 2 loads for every store, and with AOS you now have to write back only 2 out of every 4 units. With SOA, you can load 4 units from `u`, 4 from `b`, and then write 4 back to `u` without needing to perform any shuffling or masking.
Adam Rosenfield
+2  A: 

Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:

int i;
for(i = 0; i < n - 3; i += 4)
{
  load elements u[i,i+1,i+2,i+3]
  load elements b[i,i+1,i+2,i+3]
  vector multiply u * c
  vector multiply s * b
  add partial results
  store back to u[i,i+1,i+2,i+3]
}

// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
  u[i] = c * u[i] + s * b[i];

For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.

Adam Rosenfield
Definately found this code as a bottleneck. A question to check that me learning and implementing vectorizing isn't a wasted effort - compilers won't generally automatically vectorize such code right?
Projectile Fish
@Projectile if you tell compiler about aliasing, generally it will.From my own experience, it's very unusual to generate better code than compiler without very significant effort.
aaa
+1  A: 

probably yes, but you have to help compiler with some hints. __restrict__ placed on pointers tells compiler that there is no alias between two pointers. if you know alignment of your vectors, communicate that to compiler (Visual C++ may have some facility).

I am not familiar with Visual C++ myself, but I have heard it is no good for vectorization. Consider using Intel compiler instead. Intel allows pretty fine-grained control over assembly generated: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm

aaa
who know Intel processor better than themselves ? :)
YeenFei
A: 

_mm_set_pd is not vectorized. If taken literally, it reads the two doubles using scalar operations, then combines the two scalar doubles and copy them into the SSE register. Use _mm_load_pd instead.

rwong