views:

405

answers:

5

What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64?

double myNum = 0;
for(int i=0;i<n;i++){
    myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}
A: 

short answer no. Long answer yes, but not efficiently. You will incur the penalty for doing nonaligned loads which will negate any kind of benefit. Unless you can guarantee that b[i] successive indices are aligned, you will most likely have worse performance after the vectorization

if you know beforehand what indices are, your best that is to unroll and specify explicit indices. I did something similar using template specialization and code generation. if your interested, I can share

to answer your comment, you basically have to concentrate on a array. Easiest thing to try right away is to block you loop by a factor of two, load low and high a separately, and then use mm*_pd like usually. Pseudocode:

__m128d a, result;
for(i = 0; i < n; i +=2) {
  ((double*)(&a))[0] = A[B[i]];
  ((double*)(&a))[1] = A[B[i+1]];
  // you may also load B using packed integer instruction
  result = _mm_add_pd(result, _mm_mul_pd(a, (__m128d)(C[i])));
}

I do not remember function names exactly, may want to double check. Also, use restrict keyword with the pointers if you know there can be no aliasing issues. This will allow compiler to be much more aggressive.

aaa
Can you explain how I'd go about vectorizing this(even with the non-aligned penalty)? I'd be curious to benchmark the performance myself.
Mike
This is not going to work, because of the double indirection of the indices.
Paul R
I don't think restrict will have any benefit here, since all the writes are to a local variable. If they were computing d[i] = a[b[i]] * c[i], then restrict on d would help.
celion
A: 

This is not going to vectorize as it is, because of the double indirection of the array indices. Since you're working with doubles there is little or nothing to be gained from SSE, particularly as most modern CPUs have 2 FPUs anyway.

Paul R
Wrong, SSE2 allows to simultaneously work with two 64bit doubles in one 128bit SSE register.
LiraNuna
@Liranuna - how does processing two doubles in one SSE register give you an advantage over using two FPUs ? Indeed the extra overhead of packing two non-contiguous doubles into an SSE register will almost certainly make an SSE solution slower than a scalar implementation.
Paul R
@Paul: SSE isn't a magic optimization shroud. If you misuse it, you will find it slower than naive code. Proper usage of SSE, however, will always give you speed boost of at least 30%.
LiraNuna
@LiraNuna - I know - I was advocating against SSE in this case.
Paul R
@Paul, Why not further utilize features that are already in use? x86_64's fpu is already consistent of `mulsd` and `addsd`, why not use the same instruction (that costs the same amount of cycles) into doing DOUBLE the work? See my answer to how to make the best use of SSE(2) for this case.
LiraNuna
@Liranuna - you may see a small benefit, but I doubt that it's anywhere near 2x for this case, where the number of arithmetic operations is small, and you have a lot of loads per iteration. You also have a misaligned load in there, which is expensive on anything but a Core i7. It would be worth benchmarking it though, and comparing throughput with a simple scalar implementation.
Paul R
Yes, but the `_mm_loadu_pd` is the only true bottleneck, the rest is almost nothing compared to it. I have encouraged to align `c` and try to keep `n` even, but if it's impossible, then you're correct, this code may run with minimal benefit. It's still faster than the original, which uses too many `addpd`'s and `mulpd`'s.
LiraNuna
@LiraNuna If you get round to benchmarking this code I'd be interested to see the results. If you post your benchmark code I'll build and run it with ICC on Core 2 and Core i7.
Paul R
+3  A: 

I would start by unrolling the loop. Something like

double myNum1 = 0, myNum2=0;
for(int i=0;i<n;i+=2)
{
    myNum1 += a[b[ i ]] * c[ i ];
    myNum2 += a[b[i+1]] * c[i+1];
}
// ...extra code to handle the remainder when n isn't a multiple of 2...
double myNum = myNum1 + myNum2;

Hopefully that allows the compiler to interleave the loads with the arithmetic; profile and look at the assembly to see if there's an improvement. Ideally the compiler will generate SSE instructions, but I'm not if that happens in practice.

Unrolling further might let you do this:

__m128d sum0, sum1;
// ...initialize to zero...
for(int i=0;i<n;i+=4)
{
    double temp0 = a[b[ i ]] * c[ i ];
    double temp1 = a[b[i+1]] * c[i+1];
    double temp2 = a[b[i+2]] * c[i+2];
    double temp3 = a[b[i+3]] * c[i+3];
    __m128d pair0 = _mm_set_pd(temp0, temp1);
    __m128d pair1 = _mm_set_pd(temp2, temp3);
    sum0 = _mm_add_pd(sum0, pair0);
    sum1 = _mm_add_pd(sum1, pair1);
}
// ...extra code to handle the remainder when n isn't a multiple of 4...
// ...add sum0 and sum1, then add the result's components...

(apologies for the pseudocode at the start and end, I figure the important part was the loop). I don't know for sure whether that will be faster; it depends on the various latencies and how well the compiler can rearrange everything. Make sure you profile before and after to see whether there was an actual improvement.

Hope that helps.

celion
Also, it might not be useful at the moment, but I believe Intel's upcoming architecture, Larrabee, will have gather/scatter instructions to deal with this sort of case. See http://www.drdobbs.com/architect/216402188?pgno=4 for some information on it.
celion
+4  A: 

Here's my go at it, fully optimized and tested:

#include <emmintrin.h>

__m128d sum = _mm_setzero_pd();
for(int i=0; i<n; i+=2) {
    sum = _mm_add_pd(sum, _mm_mul_pd(
        _mm_loadu_pd(c + i),
        _mm_setr_pd(a[b[i]], a[b[i+1]])
    ));
}

if(n & 1) {
    sum = _mm_add_pd(sum, _mm_set_sd(a[b[n-1]] * c[n-1]));
}

double finalSum = _mm_cvtsd_f64(_mm_add_pd(
    sum, _mm_shuffle_pd(sum, sum, _MM_SHUFFLE2(0, 1))
));

This produces very beautiful assembly code using gcc -O2 -msse2 (4.4.1).

As you can tell, having an even n will make this loop go faster as well as an aligned c. If you can align c, change _mm_loadu_pd to _mm_load_pd for an even faster execution times.

LiraNuna
Nice one, I forgot about loading c directly.
celion
Oh, hey, wow - I didn't mean to snatch the selected answer off @celion... It was my personal fun...
LiraNuna
@LiraNuna - I'm sure I'll get over it eventually :) I think a combination of both of our answers would be optimal - my unrolling the loop again, and your loading 'c' via intrinsic.
celion
+1  A: 

Intel processors can issue two floating point operations but one load per cycle, so memory accesses are the tightest constraint. With that in mind, I aimed first to use packed loads to reduce the number of load instructions, and used packed arithmetic just because it was convenient. I've since realized that saturating memory bandwidth may be the biggest issue, and all the messing around with SSE instructions might have been premature optimization if the point was to make the code go fast rather than learn to vectorize.

SSE

The fewest possible loads with no assumption on the indices in b requires unrolling the loop four times. One 128 bit load gets four indices from b, two 128 bit loads each get a pair of adjacent doubles from c, and gathering a required independent 64-bit loads. That's a floor of 7 cycles per four iterations for serial code. (enough to saturate my memory bandwidth if access to a doesn't cache nicely). I've left out some annoying things like handling a number of iterations which is not a multiple of 4.

entry: ; (rdi,rsi,rdx,rcx) are (n,a,b,c)
  xorpd xmm0, xmm0
  xor r8, r8
loop:
  movdqa xmm1, [rdx+4*r8]
  movapd xmm2, [rcx+8*r8]
  movapd xmm3, [rcx+8*r8+8]
  movd   r9,   xmm1
  movq   r10,  xmm1
  movsd  xmm4, [rsi+8*r9]
  shr    r10,  32
  movhpd xmm4, [rsi+8*r10]
  punpckhqdq xmm1, xmm1
  movd   r9,   xmm1
  movq   r10,  xmm1
  movsd  xmm5, [rsi+8*r9]
  shr    r10,  32
  movhpd xmm5, [rsi+8*r10]
  add    r8,   4
  cmp    r8,   rdi
  mulpd  xmm2, xmm4
  mulpd  xmm3, xmm5
  addpd  xmm0, xmm2
  addpd  xmm0, xmm3
  jl loop

Getting the indices out is the most complicated part. movdqa loads 128 bits of integer data from a 16 byte aligned address (Nehalem has latency penalties for mixing the "integer" and "float" SSE instructions). punpckhqdq moves high 64 bits to low 64 bits, but in integer mode unlike the more simply named movhlpd. 32 bit shifts are done in the general purpose registers. movhpd loads one double into the upper part of an xmm register without disturbing the lower part - this is used to load the elements of a directly into packed registers.

This code distinctly faster than the code above, which is in turn faster than the simple code, and on every access pattern but the simple case B[i] = i where the naive loop is actually fastest. I also tried a few thing like a function around SUM(A(B(:)),C(:)) in Fortran which ended up basically equivalent to the simple loop.

I tested on a Q6600 (65 nm Core 2 at 2.4Ghz) with 4GB of DDR2-667 memory, in 4 modules. Testing memory bandwidth gives about 5333 MB/s, so it seems like I'm only seeing a single channel. I'm compiling with Debian's gcc 4.3.2-1.1, -O3 -Ffast-math -msse2 -Ftree-vectorize -std=gnu99.

For testing I'm letting n be one million, initializing the arrays so a[b[i]] and c[i] both equal 1.0/(i+1), with a few different patterns of indices. One allocates a with a million elements and sets b to a random permutation, another allocates a with 10M elements and uses every 10th, and the last allocates a with 10M elements and sets up b[i+1] by adding a random number from 1 to 9 to b[i]. I'm timing how long one call takes with gettimeofday, clearing the caches by calling clflush over the arrays, and measuring 1000 trials of each function. I plotted smoothed runtime distributions using some code from the guts of criterion (in particular, the kernel density estimator in the statistics package).

Bandwidth

Now, for the actual important note about bandwidth. 5333MB/s with 2.4Ghz clock is just over two bytes per cycle. My data is long enough that nothing should be cacheable, and multiplying the runtime of my loop by (16+2*16+4*64) bytes loaded per iteration if everything misses gives me almost exactly the ~5333MB/s bandwidth my system has. It should be pretty easy to saturate that bandwidth without SSE. Even assuming a were completely cached, just reading b and c for one iteration moves 12 bytes of data, and the naive can start a new iteration ever third cycle with pipelining.

Assuming anything less than complete caching on a makes arithmetic and instruction count even less of a bottleneck. I wouldn't be surprised if most of the speedup in my code comes from issuing fewer loads to b and c so more space is free to track and speculate past cache misses on a.

Wider hardware might make more difference. A Nehalem system running three channels of DDR3-1333 would need to move 3*10667/2.66 = 12.6 bytes per cycle to saturate memory bandwidth. That would be impossible for a single thread if a fits in cache - but at 64 bytes a line cache misses on the vector add up quickly - just one of the four loads in my loop missing in caches brings up the average required bandwidth to 16 bytes/cycle.

Brandon