ansaurus

Question

Answer 1

+1 A:

Well, the compiler might be taking care of these, but here are a couple of small things:

a) Why are you multiplying by -1.F? Why not just subtract? For instance:

float res = -1.F * row[i-1] + row[i+1];

could just be:

float res = row[i+1] - row[i-1];

b) This:

if (res > *max ) *max = res;
if (res < *min ) *min = res;

can be made into

if (res > *max ) *max = res;
else if (res < *min ) *min = res;

and in other places. If the first is true, the second can't be so let's not check it.

Addition:

Here's another thing. To minimize your multiplications, change

for (int j=1; j < height-1; j++){
  for (int i=0; i < width; i++){
    float res = -1.F * in_matrix[ (j-1) * width + i] + in_matrix[ (j+1) * width + i] ;

to

int h = 0;
int width2 = 2 * width;
for (int j=1; j < height-1; j++){
  h += width;
  for (int i=h; i < h + width; i++){
    float res = in_matrix[i + width2] - in_matrix[i];

and at the end of the loop

    out_matrix[i + width] =  res;

You can do similar things in other places, but hopefully you get the idea. Also, there is a minor bug,

*min = *max = -1.F * in_matrix[0] + in_matrix[ width + 1 ];

should be just in_matrix[ width ] at the end.

Justin Peel 2010-10-08 00:22:14

+1 good catches

fabrizioM 2010-10-08 00:25:18

Actually, the "else" could make things slower ! If the compiler is smart, in the absence of 'else' it will generate fcmov instructions, which are reasonably fast, but the 'else' will create a potentially unpredictable slow branch. But that presumes that the compiler is smart enough to use fcmov in the first place, and that has to be checked by reviewing the assembly listing.

2010-10-08 00:40:35

one could also get rid of "in_matrinx[i+width2]" by taking [i+width2] out and having width2 increased in the loop (remember to reset). "inc/++" is very cheap.

Pasi Savolainen 2010-10-20 15:52:19

Answer 2

+1 A:

First of all, I would rewrite the dy loop to get rid of "[ (j-1) * width + i]" and "in_matrix[ (j+1) * width + i]", and do something like:

  float* p, *q, *out;
 p = &in_matrix[(j-1)*width];
 q = &in_matrix[(j+1)*width];
 out = &out_matrix[j*width];
  for (int i=0; i < width; i++){ 
        float res = -1.F * p[i] + q[i] ; 
        if (res > *max ) *max = res; 
        if (res < *min ) *min = res; 
        out[i] =  res; 
      }

But that is a trivial optimization that the compiler may already be doing for you.

It will be slightly faster to do "q[i]-p[i]" instead of "-1.f*p[i]+q[i]", but, again, the compiler may be smart enough to do that behind your back.

The whole thing would benefit considerably from SSE2 and multithreading. I'd bet on at least a 3x speedup from SSE2 right away. Multithreading can be added using OpenMP and it will only take a few lines of code.

2010-10-08 00:23:22

+1 same as the other post. I think the compiler optimizes the matrix accesses

fabrizioM 2010-10-08 00:28:15

Answer 3

A:

The compiler might notice this but you are creating/freeing a lot of variables on the stack as you go in and out of the scope operators {}. Instead of:

for (int j=0; j < height; j++){ 
      float* row = in_matrix + j * width; 
      for (int i=1; i < width-1; i++){ 
        float res = -1.F * row[i-1] + row[i+1];

How about:

int i, j;
float *row;
float res;

for (j=0; j < height; j++){ 
      row = in_matrix + j * width; 
      for (i=1; i < width-1; i++){ 
        res = -1.F * row[i-1] + row[i+1];

No one in particular 2010-10-08 00:35:52

no, they will go into the registers

fabrizioM 2010-10-08 00:45:12

Answer 4

+1 A:

Use local variables for computing the min and max. Every time you do this:

if (res > *max ) *max = res;
if (res < *min ) *min = res;

max and min have to get written to memory. Adding restrict on the pointers would help (indicating the writes are independent), but an even better way would be something like

//Setup
float tempMin = ...
float tempMax = ...
...
    // Inner loop
    tempMin = (res < tempMin) ? res : tempMin;
    tempMax = (res > tempMax) ? res : tempMax;
...
// End
*min = tempMin;
*max = tempMax;

celion 2010-10-08 06:40:32

Answer 5

+1 A:

Profiling this with -O3 and -O2 using versions of both the clang and g++ compilers on OS X, I found that

30% of the time was spent filling the initial matrix

  matrix[j * width + i] = ((float)rand() / (float)RAND_MAX) ;

40% of the time was spent in dx_matrix on the line.

  out_matrix[j * width + i] = row[i+1] -row[i-1];

About 9% of the time was spent in the conditionals in dx_matrix .. I separated them into a separate loop to see if that helped, but it didn't change anything much.

Shark gave the suggestion that this could be improved through the use of SSE instructions.

Interestingly only about 19% of the time was spent in the dy_matrix routine.

This was running on 10k by 10k matrix ( about 1.6 seconds )

Note your results may be different if you're using a different compiler, different OS etc.

Michael Anderson 2010-10-08 07:26:09

interesting how did you get those informations ? shark ? steps to reproduce?

fabrizioM 2010-10-09 07:15:57

Shark is a profiling tool for OS X. I'm not sure what a direct equivalents under other OSes are - but you might try cachegrind in the valgrind suite if you're on linux, or gprof. I simply ran "shark -i -1 ./a.out" followed by opening the resulting profile reuslt.

Michael Anderson 2010-10-09 09:12:22

ansaurus

tags:

views:

answers:

Optimizing 1D Convolution

related questions