views:

172

answers:

5

Consider the bellow code. This code is supposed to be processing data at a fixed rate, in one second batches, It is part of an overal system and can't take up too much time.

When running over 100 lots of 1 seconds worth of data the program takes 35 seconds (or 35%), executing this function in a loop. The test loop is timed specifically with Ada.RealTime. The data is pregenerated so the majority of the execution time is definatetly in this loop.

How do I improce the code to get the processing time down to a minimum?

The code will be running on an Intel Pentium-M which is a P3 with SSE2.

package FF is new Ada.Numerics.Generic_Elementary_Functions(Float);

N : constant Integer := 820;
type A is array(1 .. N) of Float;
type A3 is array(1 .. 3) of A;

procedure F(state  : in out A3;
            result :    out A3;
            l      : in     A;
            r      : in     A) is
   s : Float;
   t : Float;
begin
   for i in 1 .. N loop
      t := l(i) + r(i);
      t := t / 2.0;
      state(1)(i) := t;
      state(2)(i) := t * 0.25 + state(2)(i) * 0.75;
      state(3)(i) := t * 1.0 /64.0 + state(2)(i) * 63.0 /64.0;
      for r in 1 .. 3 loop
         s := state(r)(i);
         t := FF."**"(s, 6.0) + 14.0;
         if t > MAX then
            t := MAX;
         elsif t < MIN then
            t := MIN;
         end if;
         result(r)(i) := FF.Log(t, 2.0);
      end loop;
   end loop;
end;

psuedocode for testing

create two arrays of 80 random A3 arrays, called ls and rs;
init the state and result A3 array
record the realtime time now, called last
for i in 1 .. 100 loop
   for j in 1 .. 80 loop
      F(state, result, ls(j), rs(j));
   end loop;
end loop;
record the realtime time now, called curr
output the duration between curr and last
+1  A: 

I'll back up Marc C here (he generally knows his stuff). I've used gprof with Gnat before. It can be tough to setup, but it works like a champ. If you like, you can use it to get a % of your runtime used by every line of code above.

I could make some suggestions (like precalculating 63.0/64.0) but a good optimizer should already be doing most of them. You need to figure out what it is not doing in the particularly CPU-consuming lines, and speed up that.

Looking over the code, I'm guessing the profiler will show you that the exponentiation and log operations are chewing up the lion's share of the time. If you could find a way to precalcuate some of that stuff that might help. That's getting ahead of myself though. Profile!

T.E.D.
trashgod
+1  A: 

It might be faster to replace the t := FF."**"(s, 6.0) + 14.0; with t := s ** 6 + 14.0; The floating point exponentiation is probably done with log's and exp's. -- Jonathan

Jonathan Parker
The beauty of open source... You can judge the implementation for yourself at http://www.web-port.net/Christfried.Webers/sources/flex/a-ngelfu__adb.htm.
Marc C
There is a lot of extra safety code in there wrapping a `**`, so +1. However, there's presumably a reason for all that stuff (I'm not a math guy). I still think you should profile before doing anything extreme.
T.E.D.
+1  A: 

First let me try to correct my answer:

That should be FF."**"(s, 6.0), and s ** 6, (not FF."*"(s, 6.0) and s * 6) in my answer. [That's odd .. the editor is still trying to remove *'s from my text.]

I just checked the source code Marc C pointed to .. by gad, it does do s ** 6!

I'll add only that I expect that some improvement would come from doing s**6 yourself, using s2 := s*s, and s_to_the_6 := s2 * s2 * s2; -- Jonathan

@user251182: You can edit your original answer instead of creating a new one. http://stackoverflow.com/faq
trashgod
Well...I fixed it for him. If you bracket code with back-tics the editor knows it is code, and won't muck with your asterixes. If you don't, it thinks you are trying to boldface or italicise something.
T.E.D.
"some improvement would come from doing s**6 yourself". i.e. t := s * s * s * s * s * s + 14.0, caused a 20s speed up to 15s; or dropped the execution time to 42.86%.I have found other ways to speed up the code.
mat_geek
A: 

The following code improvements got the run-time down to 8 seconds.

Then, adding the command line options -O3 and -mtune=pentium-m and -msse2 got that runtime down to 0.8 seconds.

My suspicions are:

  • Log(x, y) is implemented as Log(x) / Log(y) which is expencsive if y is constant. (7sec)
  • Power(x, y) is implemented as a loop; and conditionals and jumps are expensive. (20sec)

procedure "**" may be ...

r := x; for i in 2 .. y loop; r := r * x; end loop; return r;

Revised Function

package FF is new Ada.Numerics.Generic_Elementary_Functions(Float); 

N : constant Integer := 820; 
type A is array(1 .. N) of Float; 
type A3 is array(1 .. 3) of A; 

procedure F(state  : in out A3; 
            result :    out A3; 
            l      : in     A; 
            r      : in     A) is 
   -- Keep the Log of 2 so it is not recalculated
   ONE_ON_LOG_TWO : constant Float := 1 / FF.Log(2.0); 
   M1 : constant Float := 1.0 / 64.0;
   M2 : constant Float := 63.0 / 64.0;
   s : Float; 
   t : Float; 
begin 
   for i in 1 .. N loop 
      t := l(i) + r(i); 
      -- Multiply Not Divide
      t := t * 0.5; 
      state(1)(i) := t; 
      state(2)(i) := t * 0.25 + state(2)(i) * 0.75; 
      state(3)(i) := t * M1 + state(2)(i) * M2; 
      for r in 1 .. 3 loop 
         s := state(r)(i);
         -- Since we know the power hared code the multiply.
         t := s * s * s * s * s * s + 14.0; 
         if t > MAX then 
            t := MAX; 
         elsif t < MIN then 
            t := MIN; 
         end if; 
         -- Don't use Log(x,y) in a loop when y is constant. '
         result(r)(i) := FF.Log(t) * ONE_ON_LOG_TWO; 
      end loop; 
   end loop; 
end; 
mat_geek
A: 

You might want to look into making

type A3 is array(1 .. N, 1 .. 3) of Float;

That way, each operation in the outer loop will access neighbouring memory locations and you should get much better support from the cache:

  state(i)(1) := t; 
  state(i)(2) := t * 0.25 + state(i)(2) * 0.75; 
  state(i)(3) := t * M1 + state(i)(2) * M2; 

Using a rename as

  cur_state : array (1..3) of Float renames state(i);

and subsequently

  cur_state := (t, t * 0.25 + cur_state(2) * 0.75, t * M1 + cur_state(2) * M2)

might give the compiler some more hints for optimisations.

Christfried Webers