tags:

views:

807

answers:

3

Has anyone seen any numbers / analysis on whether or not use of the C / C++ restrict keyword in gcc / g++ actual provides any significant performance boost in reality ( and not just in theory )?

I've read various articles recommending / disparaging it's use, but I haven't ran across any real numbers practically demonstrating either sides arguments.

EDIT

I know that restrict is not officially part of C++, but it is supported by some compilers and I've read a paper by Christer Ericson which strongly recommends it's usage.

A: 

I tested this C-Program. Without restrict it took 12.640 seconds to complete, with restrict 12.516. Looks like it can save some time.

raphaelr
That kind of difference is in the measurement noise...
Drew Hall
That difference is almost certainly insignificant, however, you should also declare c as restricted since each time c is written to at the moment the compiler may be considering that *a *b and *inc might have been changed.
Autopulated
In your example the optimizer can detect that the parameters don't have aliasing. Try to disable inlining and you'll see a bigger difference.
Nils Pipenbrinck
But if you disable inlining, you're artificially crippling the compiler, so you no longer get an accurate picture of how much `restrict`helps on real-world code.
jalf
@raphaelr: It seems like you need to use optimization flags for restrict to be useful. Try either -O3 or -Os.
Robert S. Barnes
A: 

Note that C++ compilers that allow the restrict keyword may still ignore it. That is the case for example here.

Clifford
Actually, if you read down the page you'll notice that while restrict is ignored in C++ because of a potential conflict with user variables of the same name, `__restrict__` is supported for C++.
Robert S. Barnes
@Robert: And ignored. The difference is only that identifiers with a double underscore are reserved for system usage. Thus a \_\_restrict\_\_ should not clash with any user declared identifiers.
Martin York
@Martin: How do you know it's ignored? It's not completely clear from the documentation - seems like you could read it either way.
Robert S. Barnes
I agree that it is not clear, but it would seem inconsistent to ignore `restrict` but not `__restrict__`. Either way, it remains non-portable, and beneficial in very specific cases. I suggest that if you know it is beneficial in a particular situation, and you need that benefit to achieve success, then use it; otherwise why make the code gratuitously non-portable? I would not use it habitually, but as a last resort and after testing the actual benefit.
Clifford
@Clifford: Of course, but it's like that with pretty much any optimization - test this way and that way and then use what works.
Robert S. Barnes
+6  A: 

The restrict keyword does a difference.

I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.

Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.

I could have moved the transform function to a different file as well.

#include <math.h>

#ifdef USE_RESTRICT
#else
#define __restrict
#endif


void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n) __attribute__ ((noinline));

void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n)
{
  int i;

  // simple transform loop.

  // written with aliasing in mind. dest, src and matrix 
  // are potentially aliasing, so the compiler is forced to reload
  // the values of matrix and src for each iteration.

  for (i=0; i<n; i++)
  {
    dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + 
              src[2] * matrix[2] + src[3] * matrix[3];

    dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + 
              src[2] * matrix[6] + src[3] * matrix[7];

    dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + 
              src[2] * matrix[10] + src[3] * matrix[11];

    dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + 
              src[2] * matrix[14] + src[3] * matrix[15];

    src  += 4;
    dest += 4;
  }
}

float srcdata[4*10000];
float dstdata[4*10000];

int main (int argc, char**args)
{
  int i,j;
  float matrix[16];

  // init all source-data, so we don't get NANs  
  for (i=0; i<16; i++)   matrix[i] = 1;
  for (i=0; i<4*10000; i++) srcdata[i] = i;

  // do a bunch of tests for benchmarking. 
  for (j=0; j<10000; j++)
    transform (dstdata, srcdata, matrix, 10000);
}

Results: (on my 2 Ghz Core Duo)

nils@doofnase:~$ gcc -O3 test.c
nils@doofnase:~$ time ./a.out

real    0m2.517s
user    0m2.516s
sys     0m0.004s

nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c
nils@doofnase:~$ time ./a.out

real    0m2.034s
user    0m2.028s
sys     0m0.000s

Over the thumb 20% faster execution, on that system.

To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c
root@beagleboard:~# time ./a.out

real    0m 7.64s
user    0m 7.62s
sys     0m 0.00s

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c 
root@beagleboard:~# time ./a.out

real    0m 7.00s
user    0m 6.98s
sys     0m 0.00s

Here the difference is just 9% (same compiler btw.)

Nils Pipenbrinck
Nice work. There is an article on the use of restrict on a Cell processor here: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html that may be relevant to the discussion architecture specific benefits.
Clifford
@Nils Pipenbrinck: Why do you have to disable inlining for the function? It seems like an awfully big function for the compiler to automatically inline.
Robert S. Barnes
@Nils Pipenbrinck: By the way Ulrich Drepper posted code for a superoptimized matrix multiply as part of his discussion of optimizing cache and memory usage. It's here: http://lwn.net/Articles/258188/ . His discussion of each step he went through to arrive at that solution is here: http://lwn.net/Articles/255364/ . He was able to reduce the execution time by 90% over a standard MM.
Robert S. Barnes
Robert S. Barnes