ansaurus

Question

I'm trying to optimize this c code using 4-way loop unrolling...

Answer 1

+5 A:

You can replace the inner loop with 8 explicit lines of code

          dst[RIDX(dim-1-jj, i, dim)] = src[RIDX(i, jj, dim)];
          dst[RIDX(dim-1-(jj+1), i, dim)] = src[RIDX(i, (jj+1), dim)];
          ...
          dst[RIDX(dim-1-(jj+7), i, dim)] = src[RIDX(i, (jj+7), dim)];

so you are replacing the loop variable by explicitly writing a line for each value it takes.

Now you can repeat that for the 8 values of the next loop, you'll have 8 x 8 lines of code, and so on.

As anything other than an exercise in understanding, this seems pretty pointless to me, compilers do this kind of stuff really efficiently, they'll optimise where it makes sense. Hand-rolling rarely produces optimal code.

djna 2009-10-01 05:52:48

+1 I recommend the OP profile his code and see where the bottlenecks are, then disassemble his code and see how the compiler is handling the bottlenecks.

Chris Lutz 2009-10-01 05:55:10

Answer 2

+2 A:

gcc -funrull-loops

You shouldn't unroll loops yourself unless GCC cannot do it (look at the assembly) and you've proven by using a profiler that you have to speed up this part of the code.

That example code you have looks like a perfect candidate for automatic loop unrolling.

Some other useful flags:

-O3                          // turns on a lot of optimizations (almost all)
-ftree-vectorize -msse2      // vectorizes automatically some loops

Georg 2009-10-01 06:03:47

+1 for -ftree-vectorize and -msse2

LiraNuna 2009-10-02 00:18:36

Answer 3

+4 A:

I wanted to say profile it - but then I did so myself. The surprising part is - the inner loop performs fastest with exactly your layout - unrolling it by hand is actually slower.

However - the real catch is the RIDX macro. Switching the memory layout and switching the outer loops has a significant impact.

Here's my fastest version with indentation to show where it differs from your version. The RIDX macro is assumed to be as defined.

#define RIDX(x,y,d) (x+(y)*(d))
typedef unsigned char pixel;
void rotate8(int dim, pixel *src, pixel *dst)
{
    int i, j, ii, jj;
        for(jj = 0; jj < dim; jj += 8)
    for(ii = 0; ii < dim; ii += 8)
              for (i = ii; i < ii + 8; i++)
                  for (j = jj; j < jj + 8; j++)
                      dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}

... lesson learned: Always profile :-)

phoku 2009-10-01 14:37:31

Answer 4

A:

http://www.relisoft.com/book/lang/pointer/2ptrarr.html

If your compiler is unable to optimize the human readable, maintainable version of the algorithm, and you have to double as a human compiler-- buy a new compiler! Nobody can afford human compilers any more. So, have mercy on yourself and your fellow programmers who will have to look at your code.

ralu 2009-10-02 00:12:50

ansaurus

tags:

views:

answers:

I'm trying to optimize this c code using 4-way loop unrolling...

related questions