Manually unwinding loops might be innefficient on newer processors but they can still be useful on GPUs and light architectures such as ARM as they are not as good as current generation CPUs processor at predicting and because tests and jumps actually waste cycles on those processors.
That said, it should only be done on very tight loops and in blocks, because by unrolling you are significantly bloating code size and this will blow the cache on small devices and you will end up with a much worst problem on your hand.
A note of warning though, unrolling a loop should be the very last resort when optimizing. It perverts your code at a level that makes it unmaintainable and someone reading it might snap and threathen you and your family later on. Knowing that, make it worth it :)
Use of macros can greatly help in making the code more readable and it will make the unroll deliberate.
Example:
for(int i=0; i<256; i++)
{
a+=(ptr + i) << 8;
a-=(ptr + i - k) << 8;
// And possibly some more
}
Can unroll to:
#define UNROLL (i) \
a+=(ptr[i]) << 8; \
a-=(ptr[i-k]) << 8;
for(int i=0; i<32; i++)
{
UNROLL(i);
UNROLL(i+1);
UNROLL(i+2);
UNROLL(i+3);
UNROLL(i+4);
UNROLL(i+5);
UNROLL(i+6);
UNROLL(i+7);
}
On an unrelated note but still somewhat related, if you really want to win on the instruction count side, make sure all constants get unified into as less immediates as possible in your code so that you don't end up with the following assembly:
// Bad
MOV r1, 4
// ...
ADD r2, r2, 1
// ...
ADD r2, r2, 4
Instead of:
// Better
ADD r2, r2, 8
Usually, serious compilers protect you against this kind of things, but not all will. Keep those '#define', 'enum' and 'static const' handy, not all compilers will optimize local 'const' variables.