ansaurus

Question

Answer 1

+1 A:

The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.

In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.

Amardeep 2010-07-27 17:31:49

assuming `doAction<int>` instantiations aren't folded again, the code needs to be fetched from L2/L3/main memory. With th typical memory/execution cycle ratio, even if you perfectly predict what instructions come next, you can't shovel them into L1 cache fast enough.

peterchen 2010-07-28 20:08:18

Answer 2

+2 A:

It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.

However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.

Scalability limit

Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.

As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:

DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS

gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.

Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...) Typical x64 instruction lengths are even larger.

How much does your cache help?
I found the following number (different source, so it's hard to compare): i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.

In addition, your L2 cache will kick in completely only if

you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
there's no significant amount of data being processed
there's no significant other code in your "inner loop"
there's no thread executing on this core

Given that, you can lose much earlier, especially on a CPU/board with smaller caches.

peterchen 2010-07-28 21:41:13

ansaurus

tags:

views:

answers:

Does this code fill the CPU cache?

related questions