views:

57

answers:

2

Hi all,

I am using SIMD to compute fast exponentiation result. I compare the timing with non-simd code. The exponentiation is implemented using square and multiply algorithm.

Ordinary(non-simd) version of code:

b = 1;  
for (i=WPE-1; i>=0; --i){  
    ew = e[i];  
    for(j=0; j<BPW; ++j){  
        b = (b * b) % p;  
        if (ew & 0x80000000U)  b = (b * a) % p;  
        ew <<= 1;  
    }  
}  

SIMD version:

   B.data[0] = B.data[1] = B.data[2] = B.data[3] = 1U;  
   P.data[0] = P.data[1] = P.data[2] = P.data[3] = p;  
   for (i=WPE-1; i>=0; --i) {  
      EW.data[0] = e1[i]; EW.data[1] = e2[i]; EW.data[2] = e3[i]; EW.data[3] = e4[i];  
      for (j=0; j<BPW;++j){  
         B.v *= B.v; B.v -= (B.v / P.v) * P.v;  
         EWV.v = _mm_srli_epi32(EW.v,31);  
         M.data[0] = (EWV.data[0]) ? a1 : 1U;  
         M.data[1] = (EWV.data[1]) ? a2 : 1U; 
         M.data[2] = (EWV.data[2]) ? a3 : 1U; 
         M.data[3] = (EWV.data[3]) ? a4 : 1U;  
         B.v *= M.v; B.v -= (B.v / P.v) * P.v;  
         EW.v = _mm_slli_epi32(EW.v,1);  
      }  
   } 

The issue is though it is computing correctly, simd version is taking more time than non-simd version.

Please help me debug the reasons. Any suggestions on SIMD coding is also welcome.

Thanks & regards, Anup.

+3  A: 

All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)

VJo
+1: moving between scalar code and SIMD code is expensive - SIMD optimization for any given loop needs to be "all or nothing"
Paul R
Do you mean that I need to replace assignment, multiplication, division operations by SIMD counterparts? I am using SSE2. I see that for the above example, there is not any multiplication function which computes product of 4 32-bit numbers at one go. The same applies for division as well. What should be done then?
anup
@anup I see you are copying some data from e1,e2,e3,e4 arrays to EW.data array. That is bad. Then you are doing some operations on that data. From SSE2 functions you are just using shift. If the SSE2 doesn't have the functions you need, then you can not use it. Or you have to do something smart
VJo
Well, I am new to SIMD, therefore I don't have much knowledge of it and how to do the individual operations. Can you please explain why those assignments are bad?
anup
A: 

A SIMD loop for 32 bit int data typically looks something like this:

for (i = 0; i < N; i += 4)
{
    // load input vector(s) with data at array index i..i+3
    __m128 va = _mm_load_si128(&A[i]);
    __m128 vb = _mm_load_si128(&B[i]);

    // process vectors using SIMD instructions (i.e. no scalar code)
    __m128 vc = _mm_add_epi32(va, vb);

    // store result vector(s) at array index i..i+3
    _mm_store_si128(&C[i], vc);
}

If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.

Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a prior knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.

Paul R