ansaurus

Question

Answer 1

A:

it depends on how you placed u and b in memory. if both memory block are far from each other, SSE wouldn't boost much in this scenario.

it is suggested that the array u and b are AOE (array of structure) instead of SOA (structure of array), because you can load both of them into register in single instruction.

YeenFei 2010-05-27 03:52:42

I disagree that using an AOS here will be advantageous over an SOA. You're still doing 2 loads for every store, and with AOS you now have to write back only 2 out of every 4 units. With SOA, you can load 4 units from `u`, 4 from `b`, and then write 4 back to `u` without needing to perform any shuffling or masking.

Adam Rosenfield 2010-05-27 04:00:03

Answer 2

+2 A:

Yes, this is an excellent candidate for vectorization. But, before you do so, make sure you've profiled your code to be sure that this is actually worth optimizing. That said, the vectorization would go something like this:

int i;
for(i = 0; i < n - 3; i += 4)
{
  load elements u[i,i+1,i+2,i+3]
  load elements b[i,i+1,i+2,i+3]
  vector multiply u * c
  vector multiply s * b
  add partial results
  store back to u[i,i+1,i+2,i+3]
}

// Finish up the uneven edge cases (or skip if you know n is a multiple of 4)
for( ; i < n; i++)
  u[i] = c * u[i] + s * b[i];

For even more performance, you can consider prefetching further array elements, and/or unrolling the loop and using software pipelining to interleave the computation in one loop with the memory accesses from a different iteration.

Adam Rosenfield 2010-05-27 04:06:15

Definately found this code as a bottleneck. A question to check that me learning and implementing vectorizing isn't a wasted effort - compilers won't generally automatically vectorize such code right?

Projectile Fish 2010-05-27 04:17:23

@Projectile if you tell compiler about aliasing, generally it will.From my own experience, it's very unusual to generate better code than compiler without very significant effort.

aaa 2010-05-27 04:26:48

Answer 3

+1 A:

probably yes, but you have to help compiler with some hints. __restrict__ placed on pointers tells compiler that there is no alias between two pointers. if you know alignment of your vectors, communicate that to compiler (Visual C++ may have some facility).

I am not familiar with Visual C++ myself, but I have heard it is no good for vectorization. Consider using Intel compiler instead. Intel allows pretty fine-grained control over assembly generated: http://www.intel.com/software/products/compilers/docs/clin/main_cls/cref_cls/common/cppref_pragma_vector.htm

aaa 2010-05-27 04:07:23

who know Intel processor better than themselves ? :)

YeenFei 2010-05-27 05:05:03

Answer 4

A:

_mm_set_pd is not vectorized. If taken literally, it reads the two doubles using scalar operations, then combines the two scalar doubles and copy them into the SSE register. Use _mm_load_pd instead.

rwong 2010-06-30 15:19:23

ansaurus

tags:

views:

answers:

SSE SIMD Optimization For Loop

related questions