views:

181

answers:

7

Is there a fast way to multiply values of a float array in C++, to optimize this function (where count is a multiple of 4):

void multiply(float* values, float factor, int count)
{
    for(int i=0; i < count; i++)
    {
        *value *= factor;
        value++;
    }
}

A solution must work on Mac OS X and Windows, Intel and non-Intel. Think SSE, vectorization, compiler (gcc vs. MSVC).

+2  A: 

If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of #ifdefs.

Have you tried some manual loop unrolling, and seeing if it makes any difference?

Oli Charlesworth
+2  A: 

Since you know the count is a multiple of 4, you can unroll your loop...

void multiply(float* values, float factor, int count)
{
    count = count >> 2; // count / 4
    for(int i=0; i < count ; i++)
    {
        *value *= factor;
        *(value+1) *= factor;
        *(value+2) *= factor;
        *(value+3) *= factor;
        value += 4;
    }
}
st0le
This will almost certainly not be faster, since it does the same amount of multiplications, with more complex pointer arithmetic than the original. I would be interested to see your measurements to support this being an improvement.
Steve Townsend
GCC does this with `-funroll-loops`.
Alex B
@Steve: This could well make a difference, depending on how good the compiler is already (and how good the CPU's branch predictor is). The ratio of multiplies to conditional branches has risen from 1:1 to 4:1.
Oli Charlesworth
@Alex: GCC doesn't know that `count` is a multiple of 4, so this might be marginally faster. It *might* also be faster to count down to zero (`for (int i = count/4; i != 0; --i)`), and to increment the pointer on each line (`*value++ *= factor`).
Mike Seymour
@Steve, My Version Clocked `0.731s`, the original Clocked `0.944s` with an array of size `(4*10000000)`...20% faster.
st0le
@Mike, `*value++ *= factor` might indeed be faster...I tried it out, didn't really vary much from my version....
st0le
Interesting, thanks and +1
Steve Townsend
@stOle, how fast is your machine? It would have to be about 200 mhz to take that long. Say you were at 2.5GHz, 0.731s is taking 62.5 cycles per multiply. A Core2 should be able to do a couple of float muls per cycle.
Andrew Bainbridge
@Mike: GCC applies something like Duff's device internally when unrolling a loop. And chances are it'll also able to auto-vectorize the function.
sellibitze
A: 

Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.

#include <mmintrin.h>
#include <xmmintrin.h>

__m128 factor4 = _mm_set1_ps(factor);
for (int i=0; i+3 < count; i += 4)
{
   __m128 data = _mm_mul_ps(_mm_loadu_ps(values), factor4);
   _mm_storeu_ps(values, data);
   values += 4;
}
for (int i=(count/4)*4; i < count; i++)
{
   *values *= factor;
   value++;
}
rwong
+2  A: 

Have you thought of OpenMP?

Most modern Computers have multi-core CPUs and nearly every major compiler seems to have OpenMP onboard.

You gain speed with the cost of nearly nothing

http://de.wikipedia.org/wiki/OpenMP

Bigbohne
A: 

The best solution is to keep it simple, and let the compiler optimize it for you. GCC knows about SSE, SSE2, altivec and what else. If your code is too complex, your compiler won't be able to optimize it on every possible target.

G B
A: 

As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.

The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.

The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version. #ifdefs enable SIMD on platforms that support it.

Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.

doron
A: 

I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply and an add and use it like this:

void multiply(float vec[], float factor, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] *= factor;
}

void add(float vec[], float summand, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] += summand;
}

void foo(float vec[], int size)
{
  multiply(vec,2.f,size);
  add(vec,9.f,size);
}

you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop

void foo(float vec[], int size)
{
  for (int i=0; i<size; ++i) {
    vec[i] = vec[i]*2+9;
  }
}

is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.

sellibitze