ansaurus

Question

Answer 1

+3 A:

All functions in the for loops should be SIMD functions, not only two. Time taking to set the arguments for your 2 functions is less optimal then your original example (which is most likely optimized by the compiler)

VJo 2010-10-25 07:43:57

+1: moving between scalar code and SIMD code is expensive - SIMD optimization for any given loop needs to be "all or nothing"

Paul R 2010-10-25 08:57:15

Do you mean that I need to replace assignment, multiplication, division operations by SIMD counterparts? I am using SSE2. I see that for the above example, there is not any multiplication function which computes product of 4 32-bit numbers at one go. The same applies for division as well. What should be done then?

anup 2010-10-25 10:25:51

@anup I see you are copying some data from e1,e2,e3,e4 arrays to EW.data array. That is bad. Then you are doing some operations on that data. From SSE2 functions you are just using shift. If the SSE2 doesn't have the functions you need, then you can not use it. Or you have to do something smart

VJo 2010-10-25 10:32:26

Well, I am new to SIMD, therefore I don't have much knowledge of it and how to do the individual operations. Can you please explain why those assignments are bad?

anup 2010-10-25 10:37:50

Answer 2

A:

A SIMD loop for 32 bit int data typically looks something like this:

for (i = 0; i < N; i += 4)
{
    // load input vector(s) with data at array index i..i+3
    __m128 va = _mm_load_si128(&A[i]);
    __m128 vb = _mm_load_si128(&B[i]);

    // process vectors using SIMD instructions (i.e. no scalar code)
    __m128 vc = _mm_add_epi32(va, vb);

    // store result vector(s) at array index i..i+3
    _mm_store_si128(&C[i], vc);
}

If you find that you need to move between scalar code and SIMD code within the loop then you probably won't gain anything from SIMD optimisation.

Much of the skill in SIMD programming comes from finding ways to make your algorithm work with the limited number of supported instructions and data types that a given SIMD architecture provides. You will often need to exploit a prior knowledge of your data set to get the best possible performance, e.g. if you know for certain that your 32 bit integer values actually have a range that fits within 16 bits then that would make the multiplication part of your algorithm a lot easier to implement.

Paul R 2010-10-25 10:49:06

ansaurus

tags:

views:

answers:

SIMD code for exponentiation

related questions