tags:

views:

930

answers:

6

I've read the article on wikipedia, on the Duff's Device, and I don't get it. I am really interested, but I've read the explanation there a couple of times and I still don't get it how the Duff's device works.

Can some one attempt at a more detailed explanation?

Thanks

+23  A: 

Hi, the explanation in DDJ is the best that I found on the topic.

This being my AHA moment:

for (i = 0; i < len; ++i) {
HAL_IO_PORT = *pSource++;
}

becomes:

int n = len / 8;
for (i = 0; i < n; ++i) {
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
HAL_IO_PORT = *pSource++;
}

:) becomes:

int n = (len + 8 - 1) / 8;
switch (len % 8) {
case 0: do { HAL_IO_PORT = *pSource++;
case 7: HAL_IO_PORT = *pSource++;
case 6: HAL_IO_PORT = *pSource++;
case 5: HAL_IO_PORT = *pSource++;
case 4: HAL_IO_PORT = *pSource++;
case 3: HAL_IO_PORT = *pSource++;
case 2: HAL_IO_PORT = *pSource++;
case 1: HAL_IO_PORT = *pSource++;
} while (--n > 0);
}
Ric Tokyo
good post (plus I have to find one good answer from you to upvote ;) 2 down, 13 to go: http://stackoverflow.com/questions/359727#486543 ). Enjoy the nice answer badge.
VonC
@VonC :) feels nice
Ric Tokyo
-1 because I arrived here from [the JLS](http://java.sun.com/docs/books/jls/third_edition/html/statements.html#14.11) and the syntax you're using __is__ the part I don't understand
Lord Torgamus
+1 because downvoting someone because you don't understand C syntax is unfair
Dancrumb
+5  A: 

The point of duffs device is to reduce the number of comparisons done in a tight memcpy implementation.

Suppose you want to copy 'count' bytes from a to b, the straight forward approach is to do the following:

  do {                      
      *a = *b++;            
  } while (--count > 0);

How many times do you need to compare count to see if it's a above 0? 'count' times.

Now, the duff device uses a nasty unintentional side effect of a switch case which allows you to reduce the number of comparisons needed to count / 8.

Now suppose you want to copy 20 bytes using duffs device, how many comparisons would you need? Only 3, since you copy eight bytes at a time except the last first one where you copy just 4.

UPDATED: You don't have to do 8 comparisons/case-in-switch statements, but it's reasonable a trade-off between function size and speed.

Johan Dahlin
Note that duff's device is not limited to 8 duplications in the switch statement.
strager
why can't you just use instead of --count, count = count-8? and use a second loop to deal with the remainder?
hhafez
Hhafez, you can use a second loop to deal with the remainder. But now you've twice as much code to accomplish the same thing with no speed increase.
Rob Kennedy
Johan, you have it backward. The remaining 4 bytes are copied on the *first* iteration of the loop, not the last.
Rob Kennedy
@Rob: Edited to correct that.
Software Monkey
+1  A: 

Though I'm not 100% sure what you're asking for, here goes...

The issue that Duff's device addresses is one of loop unwinding (as you'll no doubt have seen on the Wiki link you posted). What this basically equates to is an optimisation of run-time efficiency, over memory footprint. Duff's device deals with serial copying, rather than just any old problem, but is a classic example of how optimisations can be made by reducing the number of times that a comparison needs to be done in a loop.

As an alternative example, which may make it easier to understand, imagine you have an array of items you wish to loop over, and add 1 to them each time... ordinarily, you might use a for loop, and loop around 100 times. This seems fairly logical and, it is... however, an optimisation can be made by unwinding the loop (obviously not too far... or you may as well just not use the loop).

So a regular for loop:

for(int i = 0; i < 100; i++)
{
    myArray[i] += 1;
}

becomes

for(int i = 0; i < 100; i+10)
{
    myArray[i] += 1;
    myArray[i+1] += 1;
    myArray[i+2] += 1;
    myArray[i+3] += 1;
    myArray[i+4] += 1;
    myArray[i+5] += 1;
    myArray[i+6] += 1;
    myArray[i+7] += 1;
    myArray[i+8] += 1;
    myArray[i+9] += 1;
}

What Duff's device does is implement this idea, in C, but (as you saw on the Wiki) with serial copies. What you're seeing above, with the unwound example, is 10 comparisons compared to 100 in the original - this amounts to a minor, but possibly significant, optimisation.

James Burgess
You're missing the key part. It's not just about loop unwinding. The switch statement jumps into the middle of the loop. That's what makes the device look so confusing. Your loop above always performs a multiple of 10 copies, but Duff's performs any number.
Rob Kennedy
That's true - but I was attempting to simplify the description for the OP. Perhaps I didn't clear that up enough! :)
James Burgess
+14  A: 

There are two key things to Duff's device. First, which I suspect is the easier part to understand, the loop is unrolled. This trades larger code size for more speed by avoiding some of the overhead involved in checking whether the loop is finished and jumping back to the top of the loop. The CPU can run faster when it's executing straight-line code instead of jumping.

The second aspect is the switch statement. It allows the code to jump into the middle of the loop the first time through. The surprising part to most people is that such a thing is allowed. Well, it's allowed. Execution starts at the calculated case label, and then it falls through to each successive assignment statement, just like any other switch statement. After the last case label, execution reaches the bottom of the loop, at which point it jumps back to the top. The top of the loop is inside the switch statement, so the switch is not re-evaluated anymore.

The original loop is unwound eight times, so the number of iterations is divided by eight. If the number of bytes to be copied isn't a multiple of eight, then there are some bytes left over. Most algorithms that copy blocks of bytes at a time will handle the remainder bytes at the end, but Duff's device handles them at the beginning. The function calculates count % 8 for the switch statement to figure what the remainder will be, jumps to the case label for that many bytes, and copies them. Then the loop continues to copy groups of eight bytes.

Rob Kennedy
+16  A: 

There are some good explanations above, but let me give it a try. (This is a lot easier on a whiteboard!) Here's the wikipedia example with some notations.

Let's say you're copying 20 bytes. The flow control of the program for the first pass is:

int count;                        // Set to 20
{
    int n = (count + 7) / 8;      // n is now 3.  (The "while" is going
                                  //              to be run three times.)

    switch (count % 8) {          // The remainder is 4 (20 modulo 8) so 
                                  // jump to the case 4

    case 0:                       // [skipped]
             do {                 // [skipped]
                 *to = *from++;   // [skipped]
    case 7:      *to = *from++;   // [skipped]
    case 6:      *to = *from++;   // [skipped]
    case 5:      *to = *from++;   // [skipped]
    case 4:      *to = *from++;   // Start here.  Copy 1 byte  (total 1)
    case 3:      *to = *from++;   // Copy 1 byte (total 2)
    case 2:      *to = *from++;   // Copy 1 byte (total 3)
    case 1:      *to = *from++;   // Copy 1 byte (total 4)
           } while (--n > 0);     // N = 3 Reduce N by 1, then jump up 
                                  //       to the "do" if it's still
    }                             //        greater than 0 (and it is)
}

Now, start the second pass, we run just the indicated code:

int count;                        // 
{
    int n = (count + 7) / 8;      // 
                                  // 

    switch (count % 8) {          // 
                                  // 

    case 0:                       // 
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 5)
    case 7:      *to = *from++;   // Copy 1 byte (total 6)
    case 6:      *to = *from++;   // Copy 1 byte (total 7)
    case 5:      *to = *from++;   // Copy 1 byte (total 8)
    case 4:      *to = *from++;   // Copy 1 byte (total 9)
    case 3:      *to = *from++;   // Copy 1 byte (total 10)
    case 2:      *to = *from++;   // Copy 1 byte (total 11)
    case 1:      *to = *from++;   // Copy 1 byte (total 12)
           } while (--n > 0);     // N = 2 Reduce N by 1, then jump up 
                                  //       to the "do" if it's still 
    }                             //       greater than 0 (and it is)
}

Now, start the third pass:

int count;                        // 
{
    int n = (count + 7) / 8;      // 
                                  // 

    switch (count % 8) {          // 
                                  // 

    case 0:                       // 
             do {                 // The while jumps to here.
                 *to = *from++;   // Copy 1 byte (total 13)
    case 7:      *to = *from++;   // Copy 1 byte (total 14)
    case 6:      *to = *from++;   // Copy 1 byte (total 15)
    case 5:      *to = *from++;   // Copy 1 byte (total 16)
    case 4:      *to = *from++;   // Copy 1 byte (total 17)
    case 3:      *to = *from++;   // Copy 1 byte (total 18)
    case 2:      *to = *from++;   // Copy 1 byte (total 19)
    case 1:      *to = *from++;   // Copy 1 byte (total 20)
           } while (--n > 0);     // N = 1  Reduce N by 1, then jump up 
                                  //       to the "do" if it's still 
    }                             //       greater than 0 (and it's not, so bail)
}                                 // continue here...

20 bytes are now copied.

Note: The original Duff's Device (shown above) copied to an I/O device at the to address thus, it wasn't necessary to increment the pointer *to. When copying between two memory buffers you'd need to use *to++.

clintp
Thanks for the comments explaining the syntax; +1!
Lord Torgamus
A: 

When I read it for the first time, I autoformatted it to this

void dsend(char* to, char* from, count) {
    int n = (count + 7) / 8;
    switch (count % 8) {
        case 0: do {
                *to = *from++;
                case 7: *to = *from++;
                case 6: *to = *from++;
                case 5: *to = *from++;
                case 4: *to = *from++;
                case 3: *to = *from++;
                case 2: *to = *from++;
                case 1: *to = *from++;
            } while (--n > 0);
    }
}

and I had no idea what was happening.

Maybe not when this question was asked, but now Wikipedia has a very good explanation

The device is valid, legal C by virtue of two attributes in C:

  • Relaxed specification of the switch statement in the language's definition. At the time of the device's invention this was the first edition of The C Programming Language which requires only that the controlled statement of the switch be a syntactically valid (compound) statement within which case labels can appear prefixing any sub-statement. In conjunction with the fact that, in the absence of a break statement, the flow of control will fall-through from a statement controlled by one case label to that controlled by the next, this means that the code specifies a succession of count copies from sequential source addresses to the memory-mapped output port.
  • The ability to legally jump into the middle of a loop in C.
Lazer