I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply
and an add
and use it like this:
void multiply(float vec[], float factor, int size)
{
for (int i=0; i<size; ++i)
vec[i] *= factor;
}
void add(float vec[], float summand, int size)
{
for (int i=0; i<size; ++i)
vec[i] += summand;
}
void foo(float vec[], int size)
{
multiply(vec,2.f,size);
add(vec,9.f,size);
}
you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop
void foo(float vec[], int size)
{
for (int i=0; i<size; ++i) {
vec[i] = vec[i]*2+9;
}
}
is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.