I'm learning about loop unrolling to avoid stalls caused by dependencies. I found many examples on internet and in literature, but I found no explanation on how the algorithm used to obtain the optimized code works (in case there is one such algorithm, of course). In particular, I don't know how to determinate how many times should the loop be unrolled. Can it be calculated beforehand?
Are you writing a compiler? Otherwise you really shouldn't do the loop unwinding yourself. You should most likely trust the compiler to do proper loop unwinding for you, where it finds it applicable.
The rule of thumb is that you unwind so that:
- operations are done on "natural" boundries 4,8,16,32.. bytes
- you don't introduce excessive register pressure (i.e. you don't start juggling registers to memory)
- you don't start repeating the exact same instruction sequence over and over
Basically you unroll as long as you can put more resources to work and you stop when you no longer can measure any performance gains.
I would like to add my reply because, while Torbjörn Gyllebring's answer is good, it isn't complete IMO
There are different improvements due to unrolling:
Algorithmic improvement - e.g. your instruciton set allows to process four rather than a single byte. Torbjörn answered that perfectly.
Reduce loop overhead when the loop body is small, the loop overhead (typically, increment + compare + jump) consumes a significant amount of time.
total cost = N * (loop body + loop overhead)
with unrolling once, you get
total cost = N/2 * (2 * loop body + loop overhead)
= N * loop body + N / 2 * loop overhead
if the loop overhead is small compared to the loop body, unrolling will give you no gains at the cost of increased code size. Example: When loop body is 10x loop overhead, unrolling gives you a 5% improvement at best.
Better pairing - on architectures with multiple pipelines (or quasi-pairing, e.g. register renaming and microcode generation), unrolling may give much better pairing opportunities. Again, they will be notable only when the loop body is small, but a formula can't be given as easy as above.
Unrolling isn't harmless - keep in mind that unrolling even in the "good" cases almost doubles the code size of the loop. This might purge other code or data from your cache. On modern desktop architectures code size issues are sever enough so that the rule of thumb is to optimize for code size globally, and optimize for speed only local hotspots.
Sometimes it makes even sense to not unroll the loop (for Core2 and up processors) because they do have a "Loop stream detector"(they call it LSD). Just look it up in the Intel optimization Guide.
If the Code do fit inside the (very small) queue, then the processor don't need to fetch/decode the instructions from the L1-Instruction-Cache which can give some performance.