views:

875

answers:

2

Recently, a coworker pointed out to me that compiling everything into a single file created much more efficient code than compiling separate object files - even with link time optimization turned on. In addition, the total compile time for the project went down significantly. Given that one of the primary reasons for using C++ is code efficiency, this was surprising to me.

Clearly, when the archiver/linker makes a library out of object files, or links them into an executable, even simple optimizations are penalized. In the example below, trival inlining costs 1.8% in performance when done by the linker instead of the compiler. It seems like compiler technology should be advanced enough to handle fairly common situations like this, but it isn't happening.

Edit:

Here is a simple example using Visual Studio 2008:

#include <cstdlib>
#include <iostream>
#include <boost/timer.hpp>

using namespace std;

int foo(int x);
int foo2(int x) { return x++; }

int main(int argc, char** argv)
{
  boost::timer t;

  t.restart();
  for (int i=0; i<atoi(argv[1]); i++)
    foo (i);
  cout << "time : " << t.elapsed() << endl;

  t.restart();
  for (int i=0; i<atoi(argv[1]); i++)
    foo2 (i);
  cout << "time : " << t.elapsed() << endl;
}

foo.cpp

int foo (int x) { return x++; }

Results of run: 1.8 % performance hit to using linked foo instead of inline foo2.

$ ./release/testlink.exe  100000000
time : 13.375
time : 13.14

And yes, the linker optimization flags (/LTCG) are on.

+20  A: 

Your coworker is out of date. The technology has been here since 2003 (on the MS C++ compiler): /LTCG. Link time code generation is dealing with exactly this problem. From what I know the GCC has this feature on the radar for the next generation compiler.

LTCG does not only optimize the code like inlining functions across modules, but actually rearanges code to optimize cache locality and branching for a specific load, see Profile-Guided Optimizations. These options are usualy reserved only for Release builds as the build can take hours to finish: will link a instrumented executable, run a profiling load and then link again with the profiling results. The link contains details about what exactly is optimized with LTCG:

Inlining – For example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.

Virtual Call Speculation – If a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.

Register Allocation – Optimizing with profile data results in better register allocation.

Basic Block Optimization – Basic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.

Size/Speed Optimization – Functions where the program spends a lot of time can be optimized for speed.

Function Layout – Based on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.

Conditional Branch Optimization – With the value probes, profile-guided optimizations can find if a given value in a switch statement is used more often than other values. This value can then be pulled out of the switch statement. The same can be done with if/else instructions where the optimizer can order the if/else so that either the if or else block is placed first depending on which block is more frequently true.

Dead Code Separation – Code that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.

EH Code Separation – The EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.

Memory Intrinsics – The expansion of intrinsics can be decided better if it can be determined if an intrinsic is called frequently. An intrinsic can also be optimized based on the block size of moves or copies.

Remus Rusanu
Regarding gcc support, see http://gcc.gnu.org/wiki/LinkTimeOptimization
Pavel Minaev
Thanks for the link Pavel
Remus Rusanu
Link time code generation is slower than compiling as one file.
drewster
@drewster: True. the operational overhead of keeping an entire project into one single module will be far worse though (source control, branching and integration, code ownership, even editing and navigating). Single file can work for a home based product. LTCG was designed to compile the Windows code base...
Remus Rusanu
@drewster: Also don't forget the whole discussion about embeding a profiling-guide based on a representative tets load into the code generation decision. No compiler can help there with tricks like dead code sparatation and branch optimization *based on actual test load values*.
Remus Rusanu
@Remus, You don't need to keep all the code in one file. #include solves that problem just fine. The question is why, even on trivial examples like this, is there such a large penalty.
drewster
Do all the above optimizations take place if you compile as a single file? Is there a difference in the performance between LTCG and compiling as a single translation unit?
jalf
@drewster: Fair enough. I'm not a compiler specialist, but I think the comipler has much more information available at disposal to optimize as it operates ona language tree, as oposed to the linker that has to content itself to operate on a object output, far less expressive than the code the compiler has seen. Hence less effort is spent into making linker optmizations that *could* match, in theory, the tricks the compiler does.
Remus Rusanu
@jalf: no. Many of the optimizations above rely on the profiling guide test load.
Remus Rusanu
Remus, could you post your final comment as an answer, so I can select it.
drewster
+3  A: 

I'm not a compiler specialist, but I think the compiler has much more information available at disposal to optimize as it operates on a language tree, as oposed to the linker that has to content itself to operate on the object output, far less expressive than the code the compiler has seen. Hence less effort is spent by linker and compiler devlopment team(s) into making linker optmizations that could match, in theory, the tricks the compiler does.

BTW, I'm sorry I distracted your original question into the ltcg discussion, I now understand your question was a little bit different, more concerned with the link time vs. compile time static optimizations possible/available.

Remus Rusanu