ansaurus

Question

Answer 1

+2 A:

This doesn't seem to be compiler related. At first I thought it could be due to compiler tricks such as loop unrolling, but looking at the generated assembly, MSVC 9.0 just generates a straightforward translation from the C++ code.

Option 1:

$LL3@main:
    add DWORD PTR [esi], ecx
    add DWORD PTR [esi+4], ecx
    add DWORD PTR [esi+8], ecx
    add DWORD PTR [esi+12], ecx
    sub eax, ecx
    jne SHORT $LL3@main

Option 2:

$LL3@main:
    add DWORD PTR [esi], ecx
    add DWORD PTR [esi+8], ecx
    add DWORD PTR [esi+16], ecx
    add DWORD PTR [esi+24], ecx
    sub eax, ecx
    jne SHORT $LL3@main

Option 3:

$LL3@main:
    add DWORD PTR [esi], ecx
    add DWORD PTR [esi+8], ecx
    sub eax, ecx
    jne SHORT $LL3@main

interjay 2010-02-05 13:35:51

Yeah I came to the same conclusion. Hence my looking at possible oddities of cache usage such as the write-port and bringing cache write latencies into the game.

Goz 2010-02-05 13:50:39

Answer 2

+2 A:

The x86 instruction set is in no way representative anymore for what is really being done by the CPU. The instructions are translated to an internal machine language, the term "micro-op" was coined back in the 486 days. Throw in stuff like register renaming, speculative execution, multiple execution units and their interaction with the cache and there's just no way to predict how long something should take anymore. Chip manufacturers have stopped posting cycle time predictions a long time ago. Their designs are a trade secret.

Hans Passant 2010-02-05 14:15:00

While yes you are right, to an extent, everything in this should be operating out of the cache. This would seem to me to be an important optimisation caveat and however secret their cycle times are a 50% hit for doing half as much work is a big hit. This is the sort of thing that the likes of intel are usually happy to explain to people because it makes their chips look good when people write lightning fast code. I'm sure it must be explained somewhere.

Goz 2010-02-05 14:18:30

@nobugz: Both Intel and AMD still document latencies for individual instructions. Of course, there's just a lot of caveats as to how instructions are scheduled and executed in parallel, and especially regarding the memory/cache subsystem.

jalf 2010-02-05 14:33:26

@Goz: I suspect that it's not a 50% hit, but rather a 33% speedup for the longer loop. The loop body is so short for case 3 that you're likely running up against a lot of hardware limitations (there's got to be a few cycles between jump instructions in order to verify the guess made by the branch predictor. Throw in cache latency and load/store dependencies, and I suspect the speed in case 2 is due to some special optimization kicking in, which doesn't generally apply, and for some reason can't be used for the shorter case 3.

jalf 2010-02-05 14:36:07

And I see no reason to believe that Intel would have documented this behavior anywhere. They document behavior that is 1) useful, an 2) reliable. Your code doesn't show that "doing twice as much work will make your code run 50% faster". If that was the case, it'd be worth documenting. Instead, it simply shows that "things start getting unpredictable when loops get short enough for latency to dominate and become a bottleneck". Unless you write a lot of 4-cycle loops, this information simply doesn't matter, so why would Intel bother documenting it?

jalf 2010-02-05 14:41:50

Posting of cycle times (and other technical details) depends primarily on competition. In the Pentium IV era, Intel's CPUs were clearly slower than AMD's -- and you could download every technical detail about Intel CPUs you could ask for, up to an including full motherboard schematics. Now that Intel CPUs are faster, you have to sign away your life before they'll tell you how many pins the chip has (okay, I exaggerate, but you get the idea).

Jerry Coffin 2010-02-05 15:19:24

Answer 3

+3 A:

I strongly suspect what you're seeing is an oddity of branch prediction rather than anything to do with caching. In particular, on quite a few CPUs, branch prediction doesn't work (as well | at all) when both the source and the target of the branch are in the same cache line. Putting enough code inside the loop (even NOPs) to get the source and target into different cache lines will give a substantial improvement in speed.

Jerry Coffin 2010-02-05 14:56:50

BP must work to some extent, or he'd see much worse performance than 6 cycles per iteration. But yeah, good point. I suggested some branch predictor issue in another comment too, but I didn't know the "same cache line" limitation. Sounds like a good guess.

jalf 2010-02-05 15:06:39

@jalf:If memory serves, the "not working at all" was only in the Pentium MMX (and possibly original Pentium). On newer processors it works to at least some degree, but still not nearly as well as for longer jumps.

Jerry Coffin 2010-02-05 15:14:52

I tried unrolling the loop for option #3 so that it is the same size as options #1 and #2, and the timing remained exactly the same. So this must not be the cause.

interjay 2010-02-05 19:14:57

Nice one Jerry. I added a bunch of "__asm nop;"s around the adds (Almost forgot that NOP is a single byte instruction though ;)). Now i'm getting the same performance! Thanks! :)

Goz 2010-02-05 21:03:40

@interjay have you actually checked the resulting assembler? The __asm nop tricks works perfectly for me :)

Goz 2010-02-05 21:04:34

Though .. before I get too far ahead of myself. Both the source and destination ARE in the same cache line but by adding 6 nops I move them further than 12 bytes apart. Until if I give 5 nops then performance is still bad.

Goz 2010-02-05 21:15:04

@Goz: Adding nops helps, but unrolling the loop doesn't. This suggests that it is related to cache performance as you thought rather than branch prediction.

interjay 2010-02-05 22:00:10

Well if you read my update you'll need to unroll it twice (So it processes 3 at a time) to see the performance increase. Unrolling just once only adds 4 bytes to the loop.

Goz 2010-02-05 22:23:26

@Goz: Your update doesn't say anything about unrolling. Anyway, I tried unrolling twice and it didn't help either.

interjay 2010-02-05 22:35:42

@Interjay .. Ok I tried it and you are right. That throws a spanner in the work. Check my NEW edit above ;)

Goz 2010-02-05 23:04:15

ansaurus

tags:

views:

answers:

Odd optimization problem under MSVC

related questions