views:

100

answers:

2

I have two ways to program the same functionality.

Method 1:

doTheWork(int action)
{
    for(int i = 0 i < 1000000000; ++i)
    {
        doAction(action);
    }
}

Method 2:

doTheWork(int action)
{
    switch(action)
    {
    case 1:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1>();
        }
        break;
    case 2:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<2>();
        }
        break;
    //-----------------------------------------------
    //... (there are 1000000 cases here)
    //-----------------------------------------------
    case 1000000:
        for(int i = 0 i < 1000000000; ++i)
        {
            doAction<1000000>();
        }
        break;
    }
}

Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.

So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?

+1  A: 

The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.

In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.

Amardeep
assuming `doAction<int>` instantiations aren't folded again, the code needs to be fetched from L2/L3/main memory. With th typical memory/execution cycle ratio, even if you perfectly predict what instructions come next, you can't shovel them into L1 cache fast enough.
peterchen
+2  A: 

It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.

However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.

Scalability limit

Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.

As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:

  • DDR3-1333 with 10667MB/s peak memory bandwidth,
  • Intel Core i7 Extreme with ~75000 MIPS

gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.

Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...) Typical x64 instruction lengths are even larger.

How much does your cache help?
I found the following number (different source, so it's hard to compare): i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.

In addition, your L2 cache will kick in completely only if

  • you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
  • there's no significant amount of data being processed
  • there's no significant other code in your "inner loop"
  • there's no thread executing on this core

Given that, you can lose much earlier, especially on a CPU/board with smaller caches.

peterchen