views:

430

answers:

13

I am a newbie on optimization. I have been reading some reference on how to optimize c++ code but I have a hard time applying it to real code. Therefore I just want to gather some real world optimization technique on how to squeeze as much juice from the CPU/Memory as possible from the loop below

double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));
for(int t = 0; t < T; ++t){
sum += fun(a,b,c,d,e,f,sum);
*(array+t) = sum;
}

where a,b,c,d,e,f are double and T is int. Anything including but not limited to memory alignment, parallelism, openmp/MPI, and SSE instructions are welcome. Compiler is standard gcc, microsoft, or commonly available compiler. If the solution is compiler specific, please specific compiler and any option flag associate with your solution.

Thanks!

PS: Forgot to mention properties fun. Please assume that it is a simple function with no loop inside, and compose of only basic arithmetic operation. Simply think of it as an inline function.

EDIT2: since the details of fun is important, please forget the parameter c, d, e, f and assume fun is defined as

inline double fun(a,b, sum){
return sum + a* ( b - sum);
}
+3  A: 

Unless fun() is very trivial - in which case consider inline, it is likely to dominate anything else you can do to the loop.

You probably want to look at the algorithm inside fun()

+6  A: 

Since sum depends on its previous values in a non-trivial way, it is impossible to parallelize the code (so OpenMP and MPI are out).

Memory alignment and SSE should be enforced/used automatically with appropriate compiler settings.

Besides inlining fun and unrolling the loop (by compiling in -O3) there is not much we can do if fun is completely general.


Since fun(a,b,sum) = sum + a*(b-sum), we have the closed form

            ab             t+1
array[t] = ———— ( 1 - (2-a)    )
            a-1 

which can be vectorized and parallelized, but the division and exponentiation can be very expensive.

Nevertheless, with the closed form we can start the loop from any index, e.g. create 2 threads one from t = 0 to T/2-1, another from t = T/2 to T-1, which perform the original loop, but the initial sum is computed using the above closed form solution. Also, if only a few values from the array is needed this can be computed lazily.

And for SSE, you could first fill the array first with (2-a)^(t+1), and then apply x :-> x - 1 to the whole array, and then apply x :-> x * c to the whole array where c = a*b/(1-a), but there may be automatic vectorization already.

KennyTM
How can you algebraically manipulate fun(a,b,sum) when he didn't tell us what it does?
phkahler
@phk: See OP's update!
KennyTM
It'd be good to put some numbers on "very expensive"--generally, I find division about 5x as expensive as multiplication, and exponentiation about 50x as expensive.
Rex Kerr
Note that `a-1` is a constant, and a double. Hence `1/(a-1)` can be precomputed. Exponentiation is to a power of t, which is constant and thus takes O(log t). Still makes the whole calculation O(T log T) though.
MSalters
+1  A: 

Have a look at callgrind, part of the valgrind toolset. Run your code through that and see if anything sticks out as taking an unusually large amount of time. Then you'll know what needs optimizing. Otherwise, you're just guessing, and you (as well as the rest of us) will likely guess wrong.

Kristo
+1  A: 

I would enable vector processing on the compiler. You could rewrite the code to open up the loops but the compiler will do it for you. If it is a later version.

You could use t+array as the for loop increment... again the optimizer might do this. means your array index won't use a multiply again optimizer might do this.

You could use the switch to dump the generated assembler code and using that see what you can change in the code to make it run faster.

Tony Lambert
but `fun` is a function of `sum` such that it cannot be moved out of the loop.
leon
I can't help but feel that this is where it can be optimized if only we knew what it did.
Tony Lambert
+3  A: 

The code you have above is about as fast as you can make it.

  • Memory alignment is normally handled well enough by malloc anyway.
  • The code cannot be parallelized because f is a function of previous sums (so you can't break up the computation into chunks).
  • The computations are not specified so it's unclear whether SSE or CUDA or anything like that would be applicable.
  • Likewise, you can't perform any useful loop-unrolling based on the properties of f because we don't know what it does.

(Stylistically, I'd use array[t] since it's clearer what's going on and it is no slower.)


Edit: now that we have f(a,b,sum) = sum + a*(b-sum), we can try loop unrolling by hand to see if there's some pattern. Like so (where I'm using ** to mean "to the power of"):

sum(n) = sum(n-1) + sum(n-1) + a*(b-sum(n-1)) = (2-a)*sum(n-1) + a*b
sum(n) = (2-a)*( (2-a)*sum(n-2) + a*b ) + a*b
. . .
sum(n) = a*b*(2-a)**n + a*b*(2-a)**(n-1) + ... + a*b
sum(n) = a*b*( (2-a)**0 + (2-a)**1 + ... + (2-a)**n )

Well, now, isn't that interesting! We've converted from a recurrent formula to a geometric series! And, you may recall that the geometric series

SUM( x^n , n = 0..N ) = (x**(n+1) - 1) / (x - 1)

so that

sum(n) = a*b*( (pow(2-a,n+1) - 1) / (1-a) )

Now that you've done that math, you can start in on the sum anywhere you want (using the somewhat expensive pow computation). If you have M free processors, and your array is long, you can split it into M equal pieces, use the above computation to find the first sum, and then use the recurrence formula you were using before (with the function) to fill the rest in.

At the very least, you could calculate a*b and 2-a separately and use those instead of the existing function:

sum = ab + twonega*sum

That cuts the math in your inner loop in half, approximately.

Rex Kerr
+3  A: 

one (very) minor optimization that can be done is:

double sum = 0, *array;   
array = (double*) malloc(T * sizeof(double));
double* pStart = array;
double* pEnd = array + T;
while(pStart < pEnd)
{
    sum += fun(a,b,c,d,e,f,sum); 
    *pStart++ = sum; 
}

this eliminates the addition of t to array for every iteration of the loop, the incrementation of t is replaced by the incrementation of pStart, for small sets of iterations(think less than 3, in that case the loop should be derolled), there is no real gain. the compiler should do this automatically, but sometimes it needs a little encouragement.

also depending on the size ranges of T it might be possible to gain performance by using a variable sized array(which would be stack allocated) or aligned _alloca

Necrolis
many processors do not actually add the T before the store they use register indirect which performs the add while executing the instruction all within the clock cycle str r1,[r2,r3]. so it can be a wash. some processors the destroyed moving pointer is faster
dwelch
of course you could do a str r1,[r2],#8 (using ARM as an example) and do the destroyed pointer with its increment in one instruction where the register indirect takes two. In general the pointer solution and indexed array solutions are a wash performance wise unless you target a specific processor, the indexed array solution is easier to read and maintain for most folks. all we are doing is savin gone maybe two instructions here, fun() plus the setup for the call will dominate the time consumed.
dwelch
+1  A: 

Just a couple of suggestions that haven't come up yet. I'm a bit out of date when it comes to modern PC-style processors, so they may make no significant difference.

  • Using float might be faster than double if you can tolerate the lower precision. Integer-based fixed point might be faster still, depending on how well floating point operations are pipelined.
  • Counting down from T to zero (and incrementing array each iteration) might be slightly faster - certainly, on an ARM processor this would save one cycle per loop.
Mike Seymour
float and double ops are usually equally faster in today's FP ALU's... the savings usually come in higher cache hit rates :-)
fortran
I am in a field that float is not acceptable :(
leon
+1  A: 

another very minor optimization would be to turn the for() into

while (--T)

as comparing with zero is usually faster than comparing two random integers.

ggiroux
+2  A: 

Accept @KennyTM's answer. He is wrong to state that the computation is not parallelisable, as he goes on to show. In showing that you can rewrite your recurrence relation in closed form he illustrates a very general principle of optimising programs -- choose the best algorithm you can find and implement that. None of the micro-optimisations that the other answers suggest will come close to computing the closed form and spreading the computation across many processors in parallel.

And, lest anyone offer the suggestion that this is just an example for learning parallelisation, I contend that @KennyTM's answer still holds good -- don't learn to optimise fragments of code, learn to optimise computations. Choose the best algorithm for your purposes, implement it well and only then worry about performance.

High Performance Mark
+1 but I think that "Choose the best algorithm for your purposes, implement it well and only then worry about performance." is misleading. After all, the whole reason why we select the best algorithm is because we are worried about performance.
John Dibling
Given the costs of exponentiation, you'd need 100+ processors to exceed the throughput of one if you split it up in the naive way. You don't want the closed form--you want to _initialize_ with the closed form but otherwise use the recurrence which updates >50x faster.
Rex Kerr
@John -- good point, replace 'only then worry about performance' with the words 'only then worry about the possibilities of micro-modifications of code for performance' or some other variant of those words which suit you best.
High Performance Mark
@Rex -- maybe so, time to get out the stopwatch and start measuring. Remember, some of us do have 100+ processors ! Sadly, I often go for a brute force solution at the expense of elegance and efficiency, 'cos it's more efficient in terms of my time.
High Performance Mark
@Rex - Indeed, I just think that many people hide behind the mantra of "premature optimization is the root of all evil" to chide *all* attempts at optimization, not just micro-optimization. @Mark - It is easy to utilize many hundreds of cores by simply porting to CUDA.
John Dibling
@HPM: Yes; I hinted at this in my answer to the question, and I also have access to 100+ processors. But I might want to use my 100+ processors for something that requires 100+ processors instead of something that doesn't. @John: Indeed, and people also hide behind the "go away and use a profiler" answer too often also. Good advice, yes, but it's not the _only_ good advice.
Rex Kerr
@HFM The intend of this post is for me to learn low level optimization. The recurrent function relation in my problem is often more complex than this and I, who has a mathematical background, always try to find a better formula for that. This question was made to be generic so that people can discuss low level optimization technique that one might be able to use. I am not asking for a solution to this specific problem but a set of techniques that I can use to make my code after after profiling. Thanks :)
leon
A: 

Following the excellent answer by @KennyTM, I'd say the fastest way to do it sequentially should be:

double acc = 1, *array;
array = (double*) malloc(T * sizeof(double));
// the compiler should be able to derive those constant expressions, but let's do it explicitly anyway
const double k = a*b/(a-1);
const double twominusa = 2 - a;
for(int t = 0; t < T; ++t){
    acc *= twominusa;
    array[t] = k*(1-acc);
}
fortran
A: 

Rather than having the compiler unroll the loop, you could unroll the loop and and some data prefetching. Search the web for data driven design c++. Here is an example of loop unrolling and prefetching data:

double sum = 0, *array;
array = (double*) malloc(T * sizeof(double));

// Calculate the number iterations and the
//    remaining iterations.
unsigned int iterations = T / 4;
unsigned int remaining_iterations = T % 4;
double sum1;
double sum2;
double sum3;
double sum4;
double * p_array = array;
for(int t = 0; t < T; T += 4)
{
    // Do some data precalculation
    sum += fun(a,b,c,d,e,f,sum);
    sum1 = sum;
    sum += fun(a,b,c,d,e,f,sum);
    sum2 = sum;
    sum += fun(a,b,c,d,e,f,sum);
    sum3 = sum;
    sum += fun(a,b,c,d,e,f,sum);
    sum4 = sum;
    // Do a "block" transfer to the array.
    p_array[0] = sum1;
    p_array[1] = sum2;
    p_array[2] = sum3;
    p_array[3] = sum4;
    p_array += 4;
}
// Handle the few remaining calculations
for (t = 0; t < remaining_iterations; ++t)
{
    sum += fun(a,b,c,d,e,f,sum);
    p_array[t] = sum;
}

The big hit here is the call to the fun function. There are hidden setup and restore instructions involved when executing a function. Also, the call forces a branch which will cause instruction pipelines to be flushed and reloaded (or cause to processor to waste time in branch prediction).

Another time performance hit is the number of variables passed to the function. These variables have to be placed on the stack and copied into the function, which takes time.

Thomas Matthews
I thought that the loop unrolling was done by the compilers today :-) plus, your manual unrolling still has data dependencies that could be alleviated with a more filled pipeline if you'd left the store operations interleaved.
fortran
The function can be inlined and thus will have no negative impact on performance.
iconiK
A: 

Many computers have a dedicated multiply-accumulate unit implemented in the processor's hardware. Depending on your ultimate algorithm and target platform, you may be able to use this if the compiler isn't already using it when it optimizes.

John Dibling
A: 

The following may not be worth it, but ....

The routine fun() takes seven (7) parameters.

Changing the order of the parameters to fun (sum, a, b, c, d, e, f) could help, IF the compiler can take advantage of the following scenario. Parameters through appear to be invariant, and only appears to be changing at this level in the code. As parameters are pushed onto the stack in C/C++ from right to left, if parameters through truly are invariant, then the compiler could in theory optimize the pushing of the stack variables. In other words, through would only need to be pushed onto the stack once, and could in theory be the only parameter pushed and popped while in the loop.

I do not know if the compiler would take advantage of such a scenario, but I am tossing it out there as a possibility. Disassembling could verify it as true or false, and profiling would indicate how much of a benefit that may be if true.

Sparky