ansaurus

Question

Fast multiplication of values in an array

Answer 1

+2 A:

If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of #ifdefs.

Have you tried some manual loop unrolling, and seeing if it makes any difference?

Oli Charlesworth 2010-09-09 11:21:07

Answer 2

+2 A:

Since you know the count is a multiple of 4, you can unroll your loop...

void multiply(float* values, float factor, int count)
{
    count = count >> 2; // count / 4
    for(int i=0; i < count ; i++)
    {
        *value *= factor;
        *(value+1) *= factor;
        *(value+2) *= factor;
        *(value+3) *= factor;
        value += 4;
    }
}

st0le 2010-09-09 11:24:19

This will almost certainly not be faster, since it does the same amount of multiplications, with more complex pointer arithmetic than the original. I would be interested to see your measurements to support this being an improvement.

Steve Townsend 2010-09-09 11:30:17

GCC does this with `-funroll-loops`.

Alex B 2010-09-09 11:36:46

@Steve: This could well make a difference, depending on how good the compiler is already (and how good the CPU's branch predictor is). The ratio of multiplies to conditional branches has risen from 1:1 to 4:1.

Oli Charlesworth 2010-09-09 11:37:39

@Alex: GCC doesn't know that `count` is a multiple of 4, so this might be marginally faster. It *might* also be faster to count down to zero (`for (int i = count/4; i != 0; --i)`), and to increment the pointer on each line (`*value++ *= factor`).

Mike Seymour 2010-09-09 11:43:57

@Steve, My Version Clocked `0.731s`, the original Clocked `0.944s` with an array of size `(4*10000000)`...20% faster.

st0le 2010-09-09 11:43:58

@Mike, `*value++ *= factor` might indeed be faster...I tried it out, didn't really vary much from my version....

st0le 2010-09-09 11:44:50

Interesting, thanks and +1

Steve Townsend 2010-09-09 12:07:22

@stOle, how fast is your machine? It would have to be about 200 mhz to take that long. Say you were at 2.5GHz, 0.731s is taking 62.5 cycles per multiply. A Core2 should be able to do a couple of float muls per cycle.

Andrew Bainbridge 2010-09-10 17:37:31

@Mike: GCC applies something like Duff's device internally when unrolling a loop. And chances are it'll also able to auto-vectorize the function.

sellibitze 2010-09-11 16:38:20

Answer 3

A:

Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.

#include <mmintrin.h>
#include <xmmintrin.h>

__m128 factor4 = _mm_set1_ps(factor);
for (int i=0; i+3 < count; i += 4)
{
   __m128 data = _mm_mul_ps(_mm_loadu_ps(values), factor4);
   _mm_storeu_ps(values, data);
   values += 4;
}
for (int i=(count/4)*4; i < count; i++)
{
   *values *= factor;
   value++;
}

rwong 2010-09-09 11:29:07

Answer 4

+2 A:

Have you thought of OpenMP?

Most modern Computers have multi-core CPUs and nearly every major compiler seems to have OpenMP onboard.

You gain speed with the cost of nearly nothing

http://de.wikipedia.org/wiki/OpenMP

Bigbohne 2010-09-09 11:32:44

Answer 5

A:

The best solution is to keep it simple, and let the compiler optimize it for you. GCC knows about SSE, SSE2, altivec and what else. If your code is too complex, your compiler won't be able to optimize it on every possible target.

G B 2010-09-09 11:35:56

Answer 6

A:

As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.

The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.

The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version. #ifdefs enable SIMD on platforms that support it.

Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.

doron 2010-09-09 13:27:33

Answer 7

A:

I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply and an add and use it like this:

void multiply(float vec[], float factor, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] *= factor;
}

void add(float vec[], float summand, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] += summand;
}

void foo(float vec[], int size)
{
  multiply(vec,2.f,size);
  add(vec,9.f,size);
}

you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop

void foo(float vec[], int size)
{
  for (int i=0; i<size; ++i) {
    vec[i] = vec[i]*2+9;
  }
}

is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.

sellibitze 2010-09-09 13:46:34

ansaurus

tags:

views:

answers:

Fast multiplication of values in an array

related questions