ansaurus

Question

Answer 1

+20 A:

Your coworker is out of date. The technology has been here since 2003 (on the MS C++ compiler): /LTCG. Link time code generation is dealing with exactly this problem. From what I know the GCC has this feature on the radar for the next generation compiler.

LTCG does not only optimize the code like inlining functions across modules, but actually rearanges code to optimize cache locality and branching for a specific load, see Profile-Guided Optimizations. These options are usualy reserved only for Release builds as the build can take hours to finish: will link a instrumented executable, run a profiling load and then link again with the profiling results. The link contains details about what exactly is optimized with LTCG:

Inlining – For example, if there exists a function A that frequently calls function B, and function B is relatively small, then profile-guided optimizations will inline function B in function A.

Virtual Call Speculation – If a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.

Register Allocation – Optimizing with profile data results in better register allocation.

Basic Block Optimization – Basic block optimization allows commonly executed basic blocks that temporally execute within a given frame to be placed in the same set of pages (locality). This minimizes the number of pages used, thus minimizing memory overhead.

Size/Speed Optimization – Functions where the program spends a lot of time can be optimized for speed.

Function Layout – Based on the call graph and profiled caller/callee behavior, functions that tend to be along the same execution path are placed in the same section.

Conditional Branch Optimization – With the value probes, profile-guided optimizations can find if a given value in a switch statement is used more often than other values. This value can then be pulled out of the switch statement. The same can be done with if/else instructions where the optimizer can order the if/else so that either the if or else block is placed first depending on which block is more frequently true.

Dead Code Separation – Code that is not called during profiling is moved to a special section that is appended to the end of the set of sections. This effectively keeps this section out of the often-used pages.

EH Code Separation – The EH code, being exceptionally executed, can often be moved to a separate section when profile-guided optimizations can determine that the exceptions occur only on exceptional conditions.

Memory Intrinsics – The expansion of intrinsics can be decided better if it can be determined if an intrinsic is called frequently. An intrinsic can also be optimized based on the block size of moves or copies.

Remus Rusanu 2009-09-09 18:44:30

Regarding gcc support, see http://gcc.gnu.org/wiki/LinkTimeOptimization

Pavel Minaev 2009-09-09 18:49:37

Thanks for the link Pavel

Remus Rusanu 2009-09-09 18:58:01

Link time code generation is slower than compiling as one file.

drewster 2009-09-09 19:19:39

@drewster: True. the operational overhead of keeping an entire project into one single module will be far worse though (source control, branching and integration, code ownership, even editing and navigating). Single file can work for a home based product. LTCG was designed to compile the Windows code base...

Remus Rusanu 2009-09-09 19:31:23

@drewster: Also don't forget the whole discussion about embeding a profiling-guide based on a representative tets load into the code generation decision. No compiler can help there with tricks like dead code sparatation and branch optimization *based on actual test load values*.

Remus Rusanu 2009-09-09 19:35:13

@Remus, You don't need to keep all the code in one file. #include solves that problem just fine. The question is why, even on trivial examples like this, is there such a large penalty.

drewster 2009-09-09 19:38:31

Do all the above optimizations take place if you compile as a single file? Is there a difference in the performance between LTCG and compiling as a single translation unit?

jalf 2009-09-09 19:48:42

@drewster: Fair enough. I'm not a compiler specialist, but I think the comipler has much more information available at disposal to optimize as it operates ona language tree, as oposed to the linker that has to content itself to operate on a object output, far less expressive than the code the compiler has seen. Hence less effort is spent into making linker optmizations that *could* match, in theory, the tricks the compiler does.

Remus Rusanu 2009-09-09 19:49:59

@jalf: no. Many of the optimizations above rely on the profiling guide test load.

Remus Rusanu 2009-09-09 19:51:31

Remus, could you post your final comment as an answer, so I can select it.

drewster 2009-09-10 15:37:20

Answer 2

+3 A:

I'm not a compiler specialist, but I think the compiler has much more information available at disposal to optimize as it operates on a language tree, as oposed to the linker that has to content itself to operate on the object output, far less expressive than the code the compiler has seen. Hence less effort is spent by linker and compiler devlopment team(s) into making linker optmizations that could match, in theory, the tricks the compiler does.

BTW, I'm sorry I distracted your original question into the ltcg discussion, I now understand your question was a little bit different, more concerned with the link time vs. compile time static optimizations possible/available.

Remus Rusanu 2009-09-10 16:38:33

ansaurus

tags:

views:

answers:

Why is linker optimization so poor?

related questions