tags:

views:

436

answers:

9

Is it possible to use any loop optimization technique here to reduce the execution time ? I need the nested loop with i and j as I need those combinations of (i,j).

EDIT: even if I leave the "actual" code, with this trivial assignment, this is taking up ~5s on my Dual Core box, whereas with that actual code, it takes up ~6s. I experimented with replacing fn_val+=0 by j+=0, and it takes ~1.73s. What could be this due to?

# include <stdio.h>
# include <time.h>

int main(int argc, char **argv)
{
        float fn_value=0.0;
        int n=10,i,j;
        unsigned int k;
        clock_t start, end;

        start = clock();
        for(k=0;k<9765625;k++)
        {

                for(i=0;i<n;i++)
                {
                        for(j=i;j<n;j++)
// substitute for an "actual" piece of code
                                fn_value+=0; 
                }
        }
        end= clock();

        printf("Time taken %lf", (double) (end-start) / CLOCKS_PER_SEC);
        return 0;
}
A: 

The loops themselves are probably irrelevant, it all depends on how much work you do in the innermost loop.

You should do some profiling, it will tell you how much time is spent where, and suggest where optimizations could be done.

unwind
Actually, there are well known loop optimization techniques. But, yeah, he needs to give us more information.
BobbyShaftoe
Pls. see the EDIT.
Amit
+1  A: 

Well, it can run parallel pretty well probably.

Stefano Borini
Well, you don't know that exactly because he didn't specify his "actual" code. :)
BobbyShaftoe
well yes, of course you have to assume the "actual piece of code" is not too serial in nature.
Stefano Borini
+1  A: 

You could do loop unrolling. Actualy, you could just specify an argument to your compiler to unroll all those loops (the actual arguments depend on your compiler).

I don't know what you're "actual code" is to be able give you more information. One thing you want to optimize your cache access if you are doing something non-trivial.

Also, are you compiling with optimization? (i.e. -O3 in gcc)

Per your edit:

The reason "j+=0" is faster than "fn_val += 0" is because integer arithemtic is MUCH faster than floating point operations.

This is why we need the actual code to give you informed optimizations.

BobbyShaftoe
Added more information about "actual code".
Amit
No. I am not using any optimizations. I was using -pg for GCC, but even without that the results do not change
Amit
Well, did you try my suggestions? The -pg option is not going to yield better performance if that's what you think it does.
BobbyShaftoe
I know what -pg does. Yep, so it looks like a floating point arithmetic issue. Thanks for that.
Amit
@Bobby: i avoided giving the "Actual" code to abstract out the details that won;t be of interest to my fellow SO-ers and I decided to abstract it out after experimentation. You have given me my answer. So Thank You!
Amit
No problem. Glad to help. :)
BobbyShaftoe
You can still push times down by multi-threading and using SIMD instructions, but I guess it deserves a different question.
RaphaelSP
@RaphaelSP, I agree. But yeah, we need the code to determine that. Also, for SIMD, we need more information about the CPU and so forth.
BobbyShaftoe
@RaphaelSP and BobbyShaftoe: Agree. I wasn't looking at the multi-threading/parallel computing options, hence I didn't post the real code.
Amit
Why aren't you using any optimizations? That will make a massive difference.
Andrew Bainbridge
+2  A: 

You may want to check out OpenMP as you can have the doStuff run on different threads for different indices of i, j, and k if "doStuff" is thread-safe.

Lyndsey Ferguson
That's a good suggestion. The only thing is he hasn't posted enough code for us to know if his actual problem is parallelizable!
BobbyShaftoe
+1  A: 

Loop unrolling does not always do better than the compiler can, as has been said else where, profile and find where the time is going.

I would first focus on the "actual" piece of code. Are there any smarts you can use to "block" up the calculations there? Reuse the previous anwer to cheaply calculate the next etc.?

djna
Please see the EDIT.
Amit
Well, that's actually incorrect. It would be nice if compilers could do everything for us. This is actually the holy grail of compiler writing. I can show you code for matrix multiply that runs much slower in naive version than with loop unrolling and this is with both versions compiled with optimizations.
BobbyShaftoe
Listened and learned and editing
djna
A: 

It really depends on what "substitute for an actual piece of code" does and how the values i, j and k are used by that code. If i, j and k are all used, then there may not be too much you can actually do (apart from multi-threading, though if used in a mathematical equation, you might be able to use some clever algebra to reduce complexity/repetition of calculations). On the other hand, if none of the values are used, then you could just make it one loop that will execute the specified number of times (though results may vary between compilers/optimization levels).

Basically, you can't optimize the loops themselves if they are the minimum of what you need. Also, this kind of micro optimizations is usually what leads to many bugs and unmaintainable code (even in the games industry, where speed is critical, we always optimize last and then only the biggest bottlenecks) and you will usually find its the algorithms not code itself that can be optimized (or replaced with faster algorithms that have a similar result). The example you have given contains no actual algorithm apart from:

fn_value = 0;
k = 9765625;
n = 10;
i = 10;
j = 10;

And as such, the code above is what you could replace your entire loop with and it would be as optimal as possible (assuming those values are used somewhere else, otherwise you can just eliminate them entirely).

Grant Peters
A: 

I heard once, long ago.... that looping down to zero can be quicker on some cpus...

So:-

for(i=0;i<n;i++)
{
    for(j=i;j<n;j++)
// substitute for an "actual" piece of code
        fn_value+=0; 
}

Becomes (I think, always get arithmetic wrong ;) ):-

for(i=n;i--;)
{
    for(j=n-i;j--;)
// substitute for an "actual" piece of code
        fn_value+=0; 
}

Of course, then your loops are backwards...

I'd love to hear if that makes a difference! My gut instinct is that you're optimizing the wrong thing.

Aha, a link:- http://www.abarnett.demon.co.uk/tutorial.html#FASTFOR

JonB
That might make a small difference if the CPU opcode for comparing to zero is shorter/faster than comparing to a constant.
Loadmaster
Did you read the link?Clearly the difference would be small, we're well into micro-optimisation territory here, there are no big gains to be had optimizing for loops.
JonB
+1  A: 

You are using floating point code! Compilers are rubbish with floating point code.

Here's some measurements I've made, I'm using DevStudio 2005 with default optimistations and I changed the code slightly:

// added to the inner part of the loop
fn_value += j; 

// added a dependancy on fn_value so that the compiler doesn't optimise the 
// whole code down to nothing
printf("Time taken %lf - %f", (double) (end-start) / CLOCKS_PER_SEC, fn_value);

So, I get this running in about 5s.

Now, I changed the code a bit:

# include <stdio.h>
# include <time.h>

int main(int argc, char **argv)
{
  int fn_value=0;
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        fn_value+=j; 
    }
  }
  end= clock();

  printf("Time taken %lf - %d", (double) (end-start) / CLOCKS_PER_SEC, fn_value);
  return 0;
}

I changed the fn_value to an int. It now takes about a second! So there's a four second overhead between adding ints and adding floats. I then wrote a version with IA32 FPU opcodes instead of the C code and got about 1.4 seconds which isn't that much slower than using ints.

Then, I used the C floating point version but made fn_value a double and the time became 1.25s. Now, that surprised me. It beat the FPU opcode version, but, looking at the dissembly, the only difference is the the pure C version unrolled the inner loop.

Also, when using floats, the result is incorrect.

Here's my final test code:

# include <stdio.h>
# include <time.h>

void p1 ()
{
  double fn_value=0;//if this is a float, the answer is slightly wrong
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  __asm fldz;
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        __asm {
          fiadd j
      }
    }
  }
  __asm fstp fn_value;
  end= clock();

  printf("p1: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}

void p2 ()
{
  double fn_value=0;
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        fn_value+=j; 
    }
  }
  end= clock();

  printf("p2: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}

void p3 ()
{
  float fn_value=0;
  int n=10,i,j;
  unsigned int k;
  clock_t start, end;

  start = clock();
  for(k=0;k<9765625;k++)
  {
    for(i=0;i<n;i++)
    {
      for(j=i;j<n;j++)
        fn_value+=j; 
    }
  }
  end= clock();

  printf("p3: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}

int main(int argc, char **argv)
{
  p1 ();
  p2 ();
  p3 ();
  return 0;
}

In summary, double appears to be faster than float. However, we need to see the contents of that inner loop to see if converting the floating point type will provide any speed up in your specific case.

UPDATE

The reason the float version is slower than the others is because the float version is constantly writing and reading the value to/from memory. The double and hand-written versions never write the value to RAM. Why does it do this. The main reason that I can think of is to decrease the precision of the value of fn_value between operations. Internally, the FPU is 80bit whereas a float is 32bit (in this implementation of C). To keep the values within the range of a float, the compiler is converting from 80bit to 32bit by writing and reading the value to/from RAM because, as far as I know, there is no FPU instruction to do this to a single FPU register. So, in order to keep the maths '32bit' (of type float) it introduces a huge overhead. The compiler is ignoring the dfference between the 80bit FPU and 64bit double type and is assuming the programmer wants the bigger type as much as possible.

Skizz
Interesting results you mention about "double". However, for me its still the same amnount of time as float. (I am reading the rest of your wonderful answer).
Amit
+1  A: 

Since your innermost loop has only 10 iterations, it would improve your speed a little if you could combine the two inner loops (for a total of 100 iterations).

Loadmaster
The issue here (from the accepted answer) is the number of FP arithmetic operations I am doing. And sadly, it doesn't look like I can decrease.
Amit