views:

90

answers:

4

I'm learning about loop unrolling to avoid stalls caused by dependencies. I found many examples on internet and in literature, but I found no explanation on how the algorithm used to obtain the optimized code works (in case there is one such algorithm, of course). In particular, I don't know how to determinate how many times should the loop be unrolled. Can it be calculated beforehand?

+1  A: 

Are you writing a compiler? Otherwise you really shouldn't do the loop unwinding yourself. You should most likely trust the compiler to do proper loop unwinding for you, where it finds it applicable.

aioobe
Not *strictly* true - there are cases where manual loop unrolling can produce better results than the compiler - but yes, in general, it's better to let the compiler do it. For some newer micro-architectures it's better still to let the hardware do the loop unrolling.
Paul R
that's a valid point, I agree.
aioobe
I'm not unwinding it myself. I'm writing some C code for a microcontroller, and while checking the machine code output from my compiler, I found that the loops had been unrolled. I'm just wondering how it works. I'm not planning on doing it by hand.
Johan B.
A: 

The rule of thumb is that you unwind so that:

  • operations are done on "natural" boundries 4,8,16,32.. bytes
  • you don't introduce excessive register pressure (i.e. you don't start juggling registers to memory)
  • you don't start repeating the exact same instruction sequence over and over

Basically you unroll as long as you can put more resources to work and you stop when you no longer can measure any performance gains.

Torbjörn Gyllebring
A: 

I would like to add my reply because, while Torbjörn Gyllebring's answer is good, it isn't complete IMO

There are different improvements due to unrolling:

Algorithmic improvement - e.g. your instruciton set allows to process four rather than a single byte. Torbjörn answered that perfectly.

Reduce loop overhead when the loop body is small, the loop overhead (typically, increment + compare + jump) consumes a significant amount of time.

total cost = N * (loop body + loop overhead)

with unrolling once, you get

total cost = N/2 * (2 * loop body + loop overhead)  
           = N * loop body + N / 2 * loop overhead

if the loop overhead is small compared to the loop body, unrolling will give you no gains at the cost of increased code size. Example: When loop body is 10x loop overhead, unrolling gives you a 5% improvement at best.

Better pairing - on architectures with multiple pipelines (or quasi-pairing, e.g. register renaming and microcode generation), unrolling may give much better pairing opportunities. Again, they will be notable only when the loop body is small, but a formula can't be given as easy as above.

Unrolling isn't harmless - keep in mind that unrolling even in the "good" cases almost doubles the code size of the loop. This might purge other code or data from your cache. On modern desktop architectures code size issues are sever enough so that the rule of thumb is to optimize for code size globally, and optimize for speed only local hotspots.

peterchen
+1  A: 

Sometimes it makes even sense to not unroll the loop (for Core2 and up processors) because they do have a "Loop stream detector"(they call it LSD). Just look it up in the Intel optimization Guide.

If the Code do fit inside the (very small) queue, then the processor don't need to fetch/decode the instructions from the L1-Instruction-Cache which can give some performance.

Quonux
+1. I encountered this recently when my software pipelined loop was half the speed of the original "stupid" (but very small) code!
Andy J Buchanan