views:

348

answers:

12

I have a loop here and I want to make it run faster. I am passing in a large array. I recently heard of Duff's Device can it be applied to this for loop? any ideas?

for (i = 0; i < dim; i++) {
    for (j = 0; j < dim; j++) {
        dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }
}
+1  A: 

I believe this is a candidate for Duff's Device, depending on what the RIDX() function does. But I hope you don't expect someone to write the code for you... Also, you might want to format your code properly so it's actually readable.

rmeador
Specifically, if RIDX modifies its second or third parameter (which to do in C it would need to be a macro) then there could be trouble. Duff's device is all about omitting most of the continuation checks, so it needs to be possible to predict how the continuation check behaves.
Steve Jessop
+1  A: 

Probably, as long as dim is a power of 2 or you have fast modulus on your target system. Learned something new today. I independently discovered that construct 5 years back and dropped it into our memCopy() routine. Who knew :)

Michael Dorgan
+16  A: 

Please, please don't use Duff's device. A thousand maintenance programmers will thank you. I used to work for a training company where someone thought it funny to introduce the device in the first ten pages of their C programming course. As an instructor it was impossible to deal with, unless (as the guy that that wrote that bit of the course apparently did) you believe in "kewl" coding.

Needless to say, I had the thing expunged from the course, ASAP.

anon
+1  A: 

Duff's Device is simply a technique for loop unrolling. And since any loop can be unrolled, you can use Duff's Device.

R Samuel Klatchko
+1. Specifically, Duff's device is a way of unrolling a loop where you don't know whether or not the number of iterations is a multiple of the length to which you want to unroll the loop.
Steve Jessop
+2  A: 

Were you able to figure this out and get a gain it would be a pittance and would in no way justify the complexity.

You would be better served spending your energies a level up--reconsidering your entire solution. Perhaps rather than copying values you could create a translation array and spend a little more time looking up answers indirectly when you need them (not really a good idea for building images--just trying to give you a different way to look at it).

Or maybe there is some completely different approach--look at your entire problem and try completely throwing away your current approaches and concepts and just see if there is something you haven't considered because you are too tied to this implementation.

Could your graphics card do some of this work?

Rethinking the problem at a high level works a lot more often than you might think.

Edit: Looking at your sample more, it looks like you are taking a block of your image and copying it, pixel for pixel, to another image. If so, there are almost certainly ways to do it getting rid of the macro and copying byte for byte instead, or even using a block move assembly function then tweaking the edges of the result to match.

Or I may have guessed wrong, but chances are that looking at it on a larger scale than pixel for pixel might help you a lot more than unrolling loops.

Bill K
+3  A: 

You should never unroll loops by hand. It would only give you a very platform-specific advantage, if any. All good compilers can unroll loops, but it's not even guaranteed to make the code faster, because it takes up more memory bandwidth to read a longer program from main memory.

If you want the loop to run fast, you should make sure that whatever RIDX computes, dst is accessed sequentially, so you minimize the number of cache misses. Other than that I can't see how you could make the loop faster.

Jørgen Fogh
I disagree about loop unrolling by hand. I have successfully unrolled loops with them being platform specific. All processors hate branch statements. They would rather process sequential data and math operations. Loop unrolling helps to satisfy this request. Also, the index update and check statements are executed less frequently in unrolled loops, speeding up execution, all this without being platform specific.
Thomas Matthews
On the topic of memory bandwidth, fooey; depends on the content of the loop. Most loops are small enough to fit inside a processor's instruction cache, so there is no additional memory bandwidth since the loop is already cached. The memory bandwidth comes into play when the loop contains calls to external functions. Also, there is no performance penalty in clearly stating your intentions of loop unrolling to the compiler (just don't optimize for space).
Thomas Matthews
@Thomas Matthews: You're absolutely right that most loops will fit in the instruction cache. Are you sure that processors really hate branches though? I would assume that a modern processor would be pretty good at predicting branches in loops. I may be wrong though.
Jørgen Fogh
+5  A: 

Why do you want to make it run faster?

Is there an actual performance problem?

If so, have you profiled and found that this is executing often enough, and hence worth optimizing?

If so, you may want to write it in two ways, the straightforward way you have now and with Duff's Device, or any other method you like.

At that point, you test the performance. You may be surprised. Modern optimizers are quite good, and modern CPUs are really complicated, so source-level optimization is often counterproductive. (I once did this in a loop that was taking a whole lot of time, and found that tightening up the loop, even while introducing some indirection, improved performance. Your mileage is almost certainly going to vary.)

Finally, if Duff's Device is indeed faster, you have to decide whether the performance improvement is worth taking this straightforward and optimizable code and substituting a maintenance problem that may not improve performance at all in the next compiler version.

David Thornley
A: 

Pedantically, no. Duff's Device was for writing to a hardware register (thus the target of the copy was always the same address).

You can implement something very much like Duff's Device for a copy like this, but there will be a clarity and maintenance cost. I'd first profile to make sure it's a problem. I'd also look into whether you can simplify the indexing, as that may enable the compiler to do the dirty work of unrolling the loop.

Adrian McCarthy
A: 

If you use it, make sure you measure it to determine that the improvement is both real , significant, and necessary in terms of your performance requirements. I doubt it will be.

For large loops, the remainder dealt with by Duff's device will be an insignificant proportion of the operation, and for small loops where the remainder is significant you will only see a benefit if you have many such loops (themselves in a loop), because small loops by definition don't take that long! Even then the compiler's optimiser is likely to do as well or better without rendering your code unreadable. It is also possible that the application of Duff's device will prevent the optimiser from applying more perhaps effective optimisations, which is why if you use it you need to measure it.

All the time you are likely to save on this (if any) you have probably wasted several times over reading responses to this question.

Clifford
A: 

Duff's device may not be the optimized solution in an unrolled loop.

I had a function that sent a bit to a port, followed by a clock pulse to another port. For each bit, the functions were:

  if (bit == 1)
  {
     write to the set port.
  }
  else
  {
     write to the clear port.
  }
  write high clock bit.
  write low clock bit.

This was put into a Duff's device loop, along with bit shifting and bit count incrementing.

I improved the efficiency of the loop by using nibble values instead of bits (a nibble being 4 bits). The switch statement was based on the nibble value. This allowed 4 bits to be processed without any if statements, improving the flow through instruction cache (pipeline).

There are times when Duff's device may not be the optimal solution; but can be the foundation for a more efficient solution.

Thomas Matthews
+1  A: 

The number of instruction cycles to implement the statement

dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];

will far outweigh the loop overhead, so unrolling the loop will be very little help on a percentage basis.

Mike Dunlavey
A: 

Modern compilers already do loop unrolling for you when optimizations are turned on, which renders Duff's device obsolete. The compiler knows better than you do the optimal level of unrolling for your compilation target, and you don't have to write any extra code to do it. It was a neat hack at the time, but these days Duff's device is just a historical curiosity, not a good programming practice.

Chris