I've seen this blog:
http://igoro.com/archive/gallery-of-processor-cache-effects/
The "weirdness" in part 7 is what caught my interest.
My first thought was "Thats just C# being weird".
Its not I wrote the following C++ code.
volatile int* p = (volatile int*)_aligned_malloc( sizeof( int ) * 8, 64 );
memset( (void*)p, 0, sizeof( int ) * 8 );
double dStart = t.GetTime();
for (int i = 0; i < 200000000; i++)
{
//p[0]++;p[1]++;p[2]++;p[3]++; // Option 1
//p[0]++;p[2]++;p[4]++;p[6]++; // Option 2
p[0]++;p[2]++; // Option 3
}
double dTime = t.GetTime() - dStart;
The timing I get on my 2.4 Ghz Core 2 Quad go as follows:
Option 1 = ~8 cycles per loop.
Option 2 = ~4 cycles per loop.
Option 3 = ~6 cycles per loop.
Now This is confusing. My reasoning behind the difference comes down to the cache write latency (3 cycles) on my chip and an assumption that the cache has a 128-bit write port (This is pure guess work on my part).
On that basis in Option 1: It will increment p[0] (1 cycle) then increment p[2] (1 cycle) then it has to wait 1 cycle (for cache) then p[1] (1 cycle) then wait 1 cycle (for cache) then p[3] (1 cycle). Finally 2 cycles for increment and jump (Though its usually implemented as decrement and jump). This gives a total of 8 cycles.
In Option 2: It can increment p[0] and p[4] in one cycle then increment p[2] and p[6] in another cycle. Then 2 cycles for subtract and jump. No waits needed on cache. Total 4 cycles.
In option 3: It can increment p[0] then has to wait 2 cycles then increment p[2] then subtract and jump. The problem is if you set case 3 to increment p[0] and p[4] it STILL takes 6 cycles (which kinda blows my 128-bit read/write port out of the water).
So ... can anyone tell me what the hell is going on here? Why DOES case 3 take longer? Also I'd love to know what I've got wrong in my thinking above, as i obviously have something wrong! Any ideas would be much appreciated! :)
It'd also be interesting to see how GCC or any other compiler copes with it as well!
Edit: Jerry Coffin's idea gave me some thoughts.
I've done some more tests (on a different machine so forgive the change in timings) with and without nops and with different counts of nops
case 2 - 0.46 00401ABD jne (401AB0h)
0 nops - 0.68 00401AB7 jne (401AB0h)
1 nop - 0.61 00401AB8 jne (401AB0h)
2 nops - 0.636 00401AB9 jne (401AB0h)
3 nops - 0.632 00401ABA jne (401AB0h)
4 nops - 0.66 00401ABB jne (401AB0h)
5 nops - 0.52 00401ABC jne (401AB0h)
6 nops - 0.46 00401ABD jne (401AB0h)
7 nops - 0.46 00401ABE jne (401AB0h)
8 nops - 0.46 00401ABF jne (401AB0h)
9 nops - 0.55 00401AC0 jne (401AB0h)
I've included the jump statetements so you can see that the source and destination are in one cache line. You can also see that we start to get a difference when we are 13 bytes or more apart. Until we hit 16 ... then it all goes wrong.
So Jerry isn't right (though his suggestion DOES help a bit), however something IS going on. I'm more and more intrigued to try and figure out what it is now. It does appear to be more some sort of memory alignment oddity rather than some sort of instruction throughput oddity.
Anyone want to explain this for an inquisitive mind? :D
Edit 3: Interjay has a point on the unrolling that blows the previous edit out of the water. With an unrolled loop the performance does not improve. You need to add a nop in to make the gap between jump source and destination the same as for my good nop count above. Performance still sucks. Its interesting that I need 6 nops to improve performance though. I wonder how many nops the processor can issue per cycle? If its 3 then that account for the cache write latency ... But, if thats it, why is the latency occurring?
Curiouser and curiouser ...