views:

953

answers:

9

Hi all,

Loop unwinding is a common way to help the compiler to optimize performance. I was wondering if and to what extent the performance gain is affected by what is in the body of the loop:

  1. number of statements
  2. number of function calls
  3. use of complex data types, virtual methods, etc.
  4. dynamic (de)allocation of memory

What rules (of thumb?) do you use to decide whether or not to unwind a performance critical loop? What other optimisation do you consider in these cases?

+12  A: 

This is an optimisation question, and as such there is only one rule of thumb: test the performance, and try a loop unwinding optimisation only if your testing demonstrates that you need to. Consider less disruptive optimisations first.

Jon Topper
+23  A: 

In general unrolling loops by hand is not worth the effort. The compiler knows better how the target architecture works and will unroll the loop if it is beneficial.

There are code-paths that benefit when unrolled for Pentium-M type CPU's but don't benefit for Core2 for example. If I unroll by hand the compiler can't make the decision anymore and I may end up with less than optimal code. E.g. exactly the opposite I tried to archive.

There are several cases where I do unroll performance critical loops by hand, but I only do this if I know that the compiler will - after manual unrolling - be able to use architectural specific feature such as SSE or MMX instructions. Then, and only then I do it.

Btw - modern CPUs are very efficient at executing well predictable branches. This is exactly what a loop is. The loop overhead is so small these days that it rarely makes a difference. Memory latency effects that may occur due to the increase in code-size will however make a difference.

Nils Pipenbrinck
s/archive/achieve/
KeyserSoze
+4  A: 

In my experience, loop unwinding, and the work it takes is effective when:

  • There are only a few statements within the loop.
  • the statements involve only small number of different variables and no function calls
  • Your operations work on already allocated memory (an in-place image transformation for example)

Partial unwinding is often less work for 80% of the gain. So instead of looping over all pixels of an N by M image (N*M iterations) where N is always divisible by 8, loop (N*M/8) times over each block of eight pixels. This is especially efficient if you are perfoming some operation that uses some of the neighbouring pixels.

I've had very good results hand-optimising pixel-wise operations into MMX or SSE instructions (8 or 16 pixels at a time) but I've also spent days hand optimising something only to find out that the version optimised by the compiler ran ten times faster.

And by the way, for the most (beautiful|remarkable) example of loop unwinding check out Duffs device

jilles de wit
I would use the word remarkable for Duffs device ;-).
Gamecat
that would also be a suitable word yes :-)
jilles de wit
While clever, I think Duff's device is bad code construction. There is no real speed advantage of his structure relative to the switch statement. Two consecutive loops, one unrolled and the other not to handle the round-off is WAY more self evident and doesn't depend on odd syntax possibilities of C
Tall Jeff
+2  A: 

An important thing to consider: In production code at your workplace, future readability of your code far outweighs the benefits of loop unwinding. Hardware is cheap, programmer time is not. I would only worry about loop unwinding if that is the ONLY way to solve a proven performance problem (say in a low-powered device).

Other thoughts: The characteristics of compilers vary greatly, and in some cases, like Java, the determination is made on the fly by the HotspotJVM, so there I would argue against loop unwinding in any case.

Galghamon
+1  A: 

Manually unwinding loops might be innefficient on newer processors but they can still be useful on GPUs and light architectures such as ARM as they are not as good as current generation CPUs processor at predicting and because tests and jumps actually waste cycles on those processors.

That said, it should only be done on very tight loops and in blocks, because by unrolling you are significantly bloating code size and this will blow the cache on small devices and you will end up with a much worst problem on your hand.

A note of warning though, unrolling a loop should be the very last resort when optimizing. It perverts your code at a level that makes it unmaintainable and someone reading it might snap and threathen you and your family later on. Knowing that, make it worth it :)

Use of macros can greatly help in making the code more readable and it will make the unroll deliberate.

Example:

for(int i=0; i<256; i++)
{
    a+=(ptr + i) << 8;
    a-=(ptr + i - k) << 8;
    // And possibly some more
}

Can unroll to:

#define UNROLL (i) \
    a+=(ptr[i]) << 8; \
    a-=(ptr[i-k]) << 8;


for(int i=0; i<32; i++)
{
    UNROLL(i);
    UNROLL(i+1);
    UNROLL(i+2);
    UNROLL(i+3);
    UNROLL(i+4);
    UNROLL(i+5);
    UNROLL(i+6);
    UNROLL(i+7);
}

On an unrelated note but still somewhat related, if you really want to win on the instruction count side, make sure all constants get unified into as less immediates as possible in your code so that you don't end up with the following assembly:

// Bad
MOV r1, 4
//  ...
ADD r2, r2, 1
//  ...
ADD r2, r2, 4

Instead of:

// Better
ADD r2, r2, 8

Usually, serious compilers protect you against this kind of things, but not all will. Keep those '#define', 'enum' and 'static const' handy, not all compilers will optimize local 'const' variables.

Coincoin
+1  A: 

Basically, unwind is useful cost of the loop structure is a significant portion of the loop body. The structure of most loops (and just about all loops that can be unrolled), consists of (a) incrementing an integer, (b) comparing it to another integer, and (c) jumping -- two of which are just about the fastest instructions for the CPU. Hence, in almost any loop, the body will out weigh the structure, yielding an insignificant gain. If you have even one function call in your body, the body will be an order of magnitude slower than the structure -- you'd never notice that.

Pretty much the only thing that can really benefit from unrolling is something like memcpy(), where the loop body is just moving a byte from spot to another --- which is why many C & C++ compilers have been automatically inlining and unrolling memcpy for the last decade.

James Curran
A: 

These optimizations are highly dependent on the cpu the code is executed on and should be done by the compiler, but if you are writing such a compiler, you may want to have a look at the Intel document Intel(R) 64 and IA-32 Architectures Optimization Reference Manual Section 3.4.1.7:

  • Unroll small loops until the overhead of the branch and induction variable accounts (generally) for less than 10% of the execution time of the loop.

  • Avoid unrolling loops excessively; this may thrash the trace cache or instruction cache.

  • Unroll loops that are frequently executed and have a predictable number of iterations to reduce the number of interations to 16 or fewer. Do this unless it increases code size so that the working set no longer fits in the trace or instruction cache. If the loop body contains more than one conditional branch, then unroll so that the number of iterations is 16/(# condtional branches).

You can also order a hard copy for free here.

Thomas Danecker
A: 

Manual loop unwinding is in general useful only for the very most trivial loops.

As a point of reference, the C++ standard library in g++ unrolls exactly two loops in the whole source, which implement the 'find' function with and without predicate, which look like:

while(first != last && !(*first == val))
  ++first;

I looked at these, and other, loops, and decided only for loops this trivial was it worth doing.

Of course, the best answer is to only unroll those loops where your profiler shows it's useful to do so!

Chris Jefferson
A: 

If you've done everything else possible, and this is your remaining hotspot, and there's almost nothing inside the loop, then unrolling makes sense. Those are a lot of "ifs". For verifying if that is your last option, try this

Mike Dunlavey