ansaurus

Question

Answer 1

+9 A:

It's called Duff's device and you can read about it on wikipedia.

It takes care of one problem with an unrolled loop: there could be a non-integer number of passes needed. One method is to deal with this outside the main loop, but it's more efficient to use Duff's device which uses a very fast jump table and avoids extra looping overhead dealing with the odd number of operations.

In your example, which is a memory copy, please compare to the naive version:

void memcpy(char* dst, char* src, size_t count)
{
   begin:
     if (count-- == 0) return;
     *(dst++) = *(src++);
     goto begin;
}

To copy 15 bytes, this does the following:

test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count, copy, loop, test count

Note how many times the "test count" and "loop" operations must be done.

Using duff's version which you showed, it is much simpler:

jump based on count, copy, copy, copy, copy, copy, copy, copy, test count, loop, copy, copy, copy, copy, copy, copy, copy, copy, test count

which saves over half the steps

Ben Voigt 2010-08-02 05:34:48

Answer 2

+2 A:

Obligatory link to Duff's Device on Wikipedia.

Bob Somers 2010-08-02 05:36:44

Answer 3

+5 A:

It's valid. It's a very old-school loop unroll.

Basically, instead of checking the count for every character that's being copied to determine when to stop, it only has to check ceil(n/8) times.

The register keyword is just a compiler hint to suggest that the compiler try to keep that value in a register instead of shuffling it in and out of main memory.

Obviously stuff like this isn't really necessary anymore (memcpy() is likely to have a very fast implementation on whatever machine you're coding for) but tricks like this used to actually provide pretty decent performance wins.

fastcall 2010-08-02 05:39:37

Using this implementation for `memcpy` is just silly, since the library will optimize by copying multiple bytes at once, instead of many one byte copies, but for other repetitive operations in which looping overhead is a lot of the run time, it can be a big improvement. For example, adding a constant to every element in an array. The addition is very fast, the looping logic actually is the biggest cost, so unrolling speeds things up a lot.

Ben Voigt 2010-08-02 05:44:13

Duff's device might still win for very small memcpy operations that are rarely aligned, for example moving around small segments of strings.

R.. 2010-08-02 05:55:41

+1 for stating it's unnecessary. The problem nowadays is far more "unreadable crappy unmaintainable code" than "slow code" :-)

paxdiablo 2010-08-02 06:03:41

@R: You might use a Duff's device variant with no loop to take care of the preamble and postamble portions, but double or quad-word copies are still going to be preferred for the middle of the copy. Even unaligned double-word copy will be faster than individual byte copies on many processors.

Ben Voigt 2010-08-02 06:06:59

My point was that special-casing the "middle of the copy" (which requires a branch or two) might be a net loss if most of your copies are so small and alignment is so random that there usually is no "middle". Obviously the duff's device approach is not the best for a general-purpose memcpy, but I can still envision rare special applications where it may be best.

R.. 2010-08-03 05:51:48

@paxdiablo: I find that nowadays, "unreadable crappy unmaintainable code" and "slow code" are usually one in the same.

R.. 2010-08-03 10:28:28

ansaurus

tags:

views:

answers:

Is this a valid C code?

related questions