tags:

views:

367

answers:

4

What I'm trying to do is take this C code and optimize it using a technique called loop unrolling, but in this case I want to use four-way loop unrolling. Now, I understand the technique and I understand the concept I just don't know how to apply it to this code. Do I have to add in some extra variables? Do I have to have some code after each loop or just at the end of all the loops? This code is 8x8 block code dealing with taking pixels and rotating it 90 degrees counter clock wise. Any help would greatly be appreciated. Thank You.

/* 
 * rotate8 - rotate with 8x8 blocking
 */

char rotate8_descr[] = "rotate8: rotate with 8x8 blocking";

void rotate8(int dim, pixel *src, pixel *dst) 
{

int i, j, ii, jj;

for(ii = 0; ii < dim; ii += 8)
       for(jj = 0; jj < dim; jj += 8)
              for (i = ii; i < ii + 8; i++)   
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
+5  A: 

You can replace the inner loop with 8 explicit lines of code

          dst[RIDX(dim-1-jj, i, dim)] = src[RIDX(i, jj, dim)];
          dst[RIDX(dim-1-(jj+1), i, dim)] = src[RIDX(i, (jj+1), dim)];
          ...
          dst[RIDX(dim-1-(jj+7), i, dim)] = src[RIDX(i, (jj+7), dim)];

so you are replacing the loop variable by explicitly writing a line for each value it takes.

Now you can repeat that for the 8 values of the next loop, you'll have 8 x 8 lines of code, and so on.

As anything other than an exercise in understanding, this seems pretty pointless to me, compilers do this kind of stuff really efficiently, they'll optimise where it makes sense. Hand-rolling rarely produces optimal code.

djna
+1 I recommend the OP profile his code and see where the bottlenecks are, then disassemble his code and see how the compiler is handling the bottlenecks.
Chris Lutz
+2  A: 
gcc -funrull-loops

You shouldn't unroll loops yourself unless GCC cannot do it (look at the assembly) and you've proven by using a profiler that you have to speed up this part of the code.

That example code you have looks like a perfect candidate for automatic loop unrolling.

Some other useful flags:

-O3                          // turns on a lot of optimizations (almost all)
-ftree-vectorize -msse2      // vectorizes automatically some loops
Georg
+1 for -ftree-vectorize and -msse2
LiraNuna
+4  A: 

I wanted to say profile it - but then I did so myself. The surprising part is - the inner loop performs fastest with exactly your layout - unrolling it by hand is actually slower.

However - the real catch is the RIDX macro. Switching the memory layout and switching the outer loops has a significant impact.

Here's my fastest version with indentation to show where it differs from your version. The RIDX macro is assumed to be as defined.

#define RIDX(x,y,d) (x+(y)*(d))
typedef unsigned char pixel;
void rotate8(int dim, pixel *src, pixel *dst)
{
    int i, j, ii, jj;
        for(jj = 0; jj < dim; jj += 8)
    for(ii = 0; ii < dim; ii += 8)
              for (i = ii; i < ii + 8; i++)
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

... lesson learned: Always profile :-)

phoku
A: 

http://www.relisoft.com/book/lang/pointer/2ptrarr.html

If your compiler is unable to optimize the human readable, maintainable version of the algorithm, and you have to double as a human compiler-- buy a new compiler! Nobody can afford human compilers any more. So, have mercy on yourself and your fellow programmers who will have to look at your code.

ralu