While reading a post on StackOverflow (http://stackoverflow.com/questions/1502081/im-trying-to-optimize-this-c-code-using-4-way-loop-unrolling), which is now marked as closed, I came across an answer (comment, in fact) that said the following: "The two inner loops could possibly get a speed boost by using UInt64 and bit shifting"
Here is the code that was int he post:
char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";
void rotate8(int dim, pixel *src, pixel *dst)
{
int i, j, ii, jj;
for(ii = 0; ii < dim; ii += 8)
for(jj = 0; jj < dim; jj += 8)
for (i = ii; i < ii + 8; i++)
for (j = jj; j < jj + 8; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
Could anyone please explain how would that be applied here? I am interested in knowing how to apply bitshifting on this code, or a similar code, and why that would help in performance. Also, how would this code be optimized for cache usage? Any suggestions?
Assume this code was Double Tiled/Blocked (big tile=32, and inside it tiles of 16), and also Loop Invariant Code Motion was applied.. would it still benefit from bitshifting and UInt64?
If not, then what other suggestions would work?
Thanks!