I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector.
For example:
vector 1 is V1(1) V1(2) V1(3) V1(4)
vector 2 is V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)
I have written Cuda code to do this (Pds has the elements of the first vector in shared memory, Nds the second Vector):
// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster
if ((tx & 0x0001) == 0x0000)
Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();
Is there anyway to further accelerate this code or avoid divergence?