views:

261

answers:

4

I Want to optimize the following function using SIMD (SSE2 & such):

int64_t fun(int64_t N, int size, int* p)
{
    int64_t sum = 0;
    for(int i=1; i<size; i++)
       sum += (N/i)*p[i];
    return sum;
}

This seems like an eminently vectorizable task, except that the needed instructions just aren't there ...

We can assume that N is very large (10^12 to 10^18) and size~sqrt(N). We can also assume that p can only take values of -1, 0, and 1; so we don't need a real multiplication, the (N/i)*p[i] can be done with four instructions (pcmpgt, pxor, psub, pand), if we could just somehow compute N/i.

+1  A: 

I suggest you do this with floating point SIMD operations - either single or double precision depending on your accuracy requirements. Conversion from int to float or double is relatively fast using SSE.

Paul R
The use of floating point is problematic, because floating point SIMD operations have a 52-bit resolution and my N is potentially larger than that...
@user434507: it depends on what your accuracy requirements are - you haven't actually stated how much precision you need. Given that N/i is a truncating integer divide then it doesn't appear that you are overly concerned about absolute numerical precision in the result ?
Paul R
I need the exact answer.
@user434507: in that case I would suggest calculating the first M terms using scalar code, until N/i becomes manageable using SIMD floating point. You could then process blocks of points and periodically update a scalar 64 bit integer sum after each block.
Paul R