tags:

views:

96

answers:

3

I am using g++ to compile a C++ code; a scientific simulation software.

Currently I am using the -O3 and -funroll-loops flags. I could notice a big difference between -O0, -O1, -O2, and -O3, and almost no difference with -funroll-loops.

Would you have any suggestions to help me to increase the optimization or tricks that I can use to get even better performances ?

Thanks !

Edit, as suggested in the comments: I am asking here about 'pure' compiling optimization, ie. is there clever things to do than just -O3. The computing intensive part of the code deals with manipulation of blitz::array in huge loops.

Edit2: I actually deal with a lot of fp (double) math

+2  A: 

It's hard to tell without knowing the code you want to accelerate. Also, knowing the code may allow us to make improvements to it, to make it faster.

As a general advice, try specifying the -march option to tell GCC what CPU model are you targeting. You can try -fomit-frame-pointer if you make many function calls (esp. recursive). If you use heavily floating point math, and stay away from corner cases (e.g. NaNs, FP exceptions), you can try -ffast-math. The last one may buy you a huge speedup, but in some cases it can bring wrong results. Analyze your code to ensure it is safe.

slacker
Thx. In fact I use a lot of fp (double) math. What does `-ffast-math` optimize ?
Cedric H.
@Cedric H.:`-ffast-math` allows the compiler to assume no corner cases arise - no NaNs, no over- or underflows, no FP exceptions, no out-of-domain arguments to standard math functions. It also tells the compiler to forget about setting the `errno` variable after some math functions - if you don't use `errno` (as most FP code doesn't), this would be just a waste of time.
slacker
@Cedric H.:These assumptions not only allow the compiler to omit some runtime sanity checks, but also enable code transformations that would cause wrong results if e.g. an underflow happens.
slacker
Thanks. It actually allowed a small but notable gain.
Cedric H.
+3  A: 

Without seeing the code, we can only give you generic advice that applies to a broad range of problems.

  1. Try GCC's profile-guided optimisation. Compile instrumented code with -fprofile-generate, do a few test runs with a realistic workload, then use the output from test run when building final binary (-fprofile-use). Then GCC can guess better which branches are taken and optimise code better.
  2. Try to parallelize your code if you can. You mentioned you have loops over big data items, this may work if your work items are independent and you can partition them. E. g. have a work queue with a worker thread pool with size equal to the number of CPUs and dispatch work to the queue instead of processing sequentially, then pool threads will grab work items off the queue and process them in parallel.
  3. Look at the size of the data units your code works with and try to fit them in as few L1 cache line (usually 64 bytes). For example if you have 66-byte data items and your cache line size is 64 bytes, it may be worth packing the structure, or otherwise squeezing it to fit in 64 bytes.
Alex B
Thanks. To use `-fprofile`, I compiled and linked with `-fprofile-generate`, ran the executable, and then compiled and linked again with `-fprofile-use`. Is this correct ?
Cedric H.
@Cedric, yes, that's correct. Don't forget to give it some work after the first compilation, so you have profile data.
Alex B
+2  A: 

I don't have enough mojo to comment or to edit Alex B's answer so I will answer instead.

After you turn on profiling and run your application per Alex B's answer, actually look at the profile information to look for hot spots where your application spends most of its time. If you find any, take a look at the code to see what you can do to make them less hot.

Appropriate algorithm replacement will generally outperform any automated optimization by a wide margin.

bbadour
+1, I didn't realize there is tool to examine the profiling output (`gcov`).
Alex B