views:

619

answers:

3

I've seen this blog:

http://igoro.com/archive/gallery-of-processor-cache-effects/

The "weirdness" in part 7 is what caught my interest.

My first thought was "Thats just C# being weird".

Its not I wrote the following C++ code.

volatile int* p = (volatile int*)_aligned_malloc( sizeof( int ) * 8, 64 );
memset( (void*)p, 0, sizeof( int ) * 8 );

double dStart   = t.GetTime();

for (int i = 0; i < 200000000; i++)
{
    //p[0]++;p[1]++;p[2]++;p[3]++;  // Option 1
    //p[0]++;p[2]++;p[4]++;p[6]++;  // Option 2
    p[0]++;p[2]++;                  // Option 3
}

double dTime    = t.GetTime() - dStart;

The timing I get on my 2.4 Ghz Core 2 Quad go as follows:

Option 1 = ~8 cycles per loop.
Option 2 = ~4 cycles per loop.
Option 3 = ~6 cycles per loop.

Now This is confusing. My reasoning behind the difference comes down to the cache write latency (3 cycles) on my chip and an assumption that the cache has a 128-bit write port (This is pure guess work on my part).

On that basis in Option 1: It will increment p[0] (1 cycle) then increment p[2] (1 cycle) then it has to wait 1 cycle (for cache) then p[1] (1 cycle) then wait 1 cycle (for cache) then p[3] (1 cycle). Finally 2 cycles for increment and jump (Though its usually implemented as decrement and jump). This gives a total of 8 cycles.

In Option 2: It can increment p[0] and p[4] in one cycle then increment p[2] and p[6] in another cycle. Then 2 cycles for subtract and jump. No waits needed on cache. Total 4 cycles.

In option 3: It can increment p[0] then has to wait 2 cycles then increment p[2] then subtract and jump. The problem is if you set case 3 to increment p[0] and p[4] it STILL takes 6 cycles (which kinda blows my 128-bit read/write port out of the water).

So ... can anyone tell me what the hell is going on here? Why DOES case 3 take longer? Also I'd love to know what I've got wrong in my thinking above, as i obviously have something wrong! Any ideas would be much appreciated! :)

It'd also be interesting to see how GCC or any other compiler copes with it as well!

Edit: Jerry Coffin's idea gave me some thoughts.

I've done some more tests (on a different machine so forgive the change in timings) with and without nops and with different counts of nops

 case 2 - 0.46  00401ABD  jne         (401AB0h)

 0 nops - 0.68  00401AB7  jne         (401AB0h) 
 1 nop  - 0.61  00401AB8  jne         (401AB0h) 
 2 nops - 0.636 00401AB9  jne         (401AB0h) 
 3 nops - 0.632 00401ABA  jne         (401AB0h) 
 4 nops - 0.66  00401ABB  jne         (401AB0h) 
 5 nops - 0.52  00401ABC  jne         (401AB0h) 
 6 nops - 0.46  00401ABD  jne         (401AB0h) 
 7 nops - 0.46  00401ABE  jne         (401AB0h) 
 8 nops - 0.46  00401ABF  jne         (401AB0h)
 9 nops - 0.55  00401AC0  jne         (401AB0h) 

I've included the jump statetements so you can see that the source and destination are in one cache line. You can also see that we start to get a difference when we are 13 bytes or more apart. Until we hit 16 ... then it all goes wrong.

So Jerry isn't right (though his suggestion DOES help a bit), however something IS going on. I'm more and more intrigued to try and figure out what it is now. It does appear to be more some sort of memory alignment oddity rather than some sort of instruction throughput oddity.

Anyone want to explain this for an inquisitive mind? :D

Edit 3: Interjay has a point on the unrolling that blows the previous edit out of the water. With an unrolled loop the performance does not improve. You need to add a nop in to make the gap between jump source and destination the same as for my good nop count above. Performance still sucks. Its interesting that I need 6 nops to improve performance though. I wonder how many nops the processor can issue per cycle? If its 3 then that account for the cache write latency ... But, if thats it, why is the latency occurring?

Curiouser and curiouser ...

+2  A: 

This doesn't seem to be compiler related. At first I thought it could be due to compiler tricks such as loop unrolling, but looking at the generated assembly, MSVC 9.0 just generates a straightforward translation from the C++ code.

Option 1:

$LL3@main:
    add DWORD PTR [esi], ecx
    add DWORD PTR [esi+4], ecx
    add DWORD PTR [esi+8], ecx
    add DWORD PTR [esi+12], ecx
    sub eax, ecx
    jne SHORT $LL3@main

Option 2:

$LL3@main:
    add DWORD PTR [esi], ecx
    add DWORD PTR [esi+8], ecx
    add DWORD PTR [esi+16], ecx
    add DWORD PTR [esi+24], ecx
    sub eax, ecx
    jne SHORT $LL3@main

Option 3:

$LL3@main:
    add DWORD PTR [esi], ecx
    add DWORD PTR [esi+8], ecx
    sub eax, ecx
    jne SHORT $LL3@main
interjay
Yeah I came to the same conclusion. Hence my looking at possible oddities of cache usage such as the write-port and bringing cache write latencies into the game.
Goz
+2  A: 

The x86 instruction set is in no way representative anymore for what is really being done by the CPU. The instructions are translated to an internal machine language, the term "micro-op" was coined back in the 486 days. Throw in stuff like register renaming, speculative execution, multiple execution units and their interaction with the cache and there's just no way to predict how long something should take anymore. Chip manufacturers have stopped posting cycle time predictions a long time ago. Their designs are a trade secret.

Hans Passant
While yes you are right, to an extent, everything in this should be operating out of the cache. This would seem to me to be an important optimisation caveat and however secret their cycle times are a 50% hit for doing half as much work is a big hit. This is the sort of thing that the likes of intel are usually happy to explain to people because it makes their chips look good when people write lightning fast code. I'm sure it must be explained somewhere.
Goz
@nobugz: Both Intel and AMD still document latencies for individual instructions. Of course, there's just a lot of caveats as to how instructions are scheduled and executed in parallel, and especially regarding the memory/cache subsystem.
jalf
@Goz: I suspect that it's not a 50% hit, but rather a 33% speedup for the longer loop. The loop body is so short for case 3 that you're likely running up against a lot of hardware limitations (there's got to be a few cycles between jump instructions in order to verify the guess made by the branch predictor. Throw in cache latency and load/store dependencies, and I suspect the speed in case 2 is due to some special optimization kicking in, which doesn't generally apply, and for some reason can't be used for the shorter case 3.
jalf
And I see no reason to believe that Intel would have documented this behavior anywhere. They document behavior that is 1) useful, an 2) reliable. Your code doesn't show that "doing twice as much work will make your code run 50% faster". If that was the case, it'd be worth documenting. Instead, it simply shows that "things start getting unpredictable when loops get short enough for latency to dominate and become a bottleneck". Unless you write a lot of 4-cycle loops, this information simply doesn't matter, so why would Intel bother documenting it?
jalf
Posting of cycle times (and other technical details) depends primarily on competition. In the Pentium IV era, Intel's CPUs were clearly slower than AMD's -- and you could download every technical detail about Intel CPUs you could ask for, up to an including full motherboard schematics. Now that Intel CPUs are faster, you have to sign away your life before they'll tell you how many pins the chip has (okay, I exaggerate, but you get the idea).
Jerry Coffin
+3  A: 

I strongly suspect what you're seeing is an oddity of branch prediction rather than anything to do with caching. In particular, on quite a few CPUs, branch prediction doesn't work (as well | at all) when both the source and the target of the branch are in the same cache line. Putting enough code inside the loop (even NOPs) to get the source and target into different cache lines will give a substantial improvement in speed.

Jerry Coffin
BP must work to some extent, or he'd see much worse performance than 6 cycles per iteration. But yeah, good point. I suggested some branch predictor issue in another comment too, but I didn't know the "same cache line" limitation. Sounds like a good guess.
jalf
@jalf:If memory serves, the "not working at all" was only in the Pentium MMX (and possibly original Pentium). On newer processors it works to at least some degree, but still not nearly as well as for longer jumps.
Jerry Coffin
I tried unrolling the loop for option #3 so that it is the same size as options #1 and #2, and the timing remained exactly the same. So this must not be the cause.
interjay
Nice one Jerry. I added a bunch of "__asm nop;"s around the adds (Almost forgot that NOP is a single byte instruction though ;)). Now i'm getting the same performance! Thanks! :)
Goz
@interjay have you actually checked the resulting assembler? The __asm nop tricks works perfectly for me :)
Goz
Though .. before I get too far ahead of myself. Both the source and destination ARE in the same cache line but by adding 6 nops I move them further than 12 bytes apart. Until if I give 5 nops then performance is still bad.
Goz
@Goz: Adding nops helps, but unrolling the loop doesn't. This suggests that it is related to cache performance as you thought rather than branch prediction.
interjay
Well if you read my update you'll need to unroll it twice (So it processes 3 at a time) to see the performance increase. Unrolling just once only adds 4 bytes to the loop.
Goz
@Goz: Your update doesn't say anything about unrolling. Anyway, I tried unrolling twice and it didn't help either.
interjay
@Interjay .. Ok I tried it and you are right. That throws a spanner in the work. Check my NEW edit above ;)
Goz