Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
Anyone know this compiler feature? It seems GCC support that. How does it work? What is the potential gain? In which case it's good? Inner loops?
(this question is specific, not about optimization in general, thanks)
The fun thing about optimization is that speed gains are found in the unlikeliest of places.
It's also the reason you need a profiler, rather than guessing where the speed problems are.
I recommend starting with a profiler (gperf
if you're using GCC) and just start poking around the results of running your application through some normal operations.
Jason's advise is right on. The best speedups you are going to get come from 'discovering' that you let an O(n^2) algorithm slip into an inner loop somewhere, or that you can cache certain computations outside of expensive functions.
Compared to the micro-optimizations that PGO can trigger, these are the big winners. Once you've done that level of optimization PGO might be able to help. We never had much luck with it though - the cost of the instrumentation was such that our application become unusably slow (by several orders of magnitude).
I like using Intel VTune as a profiler primarily because it is non-invasive compared to instrumenting profilers which change behaviour too much.
It works by placing extra code to count the number of times each codepath is taken. When you compile a second time the compiler uses the knowledge gained about execution of your program that it could only guess at before. There are a couple things PGO can work toward:
You never really know how much these things can help until you test it.
PGO gives about a 5% speed boost when compiling x264, the project I work on, and we have a built-in system for it (make fprofiled). Its a nice free speed boost in some cases, and probably helps more in applications that, unlike x264, are less made up of handwritten assembly.