views:

177

answers:

4

Im just wondering how good the MSVC++ Compiler can optimize code(with Code examples) or what he can't optimize and why.

For example i used the SSE-intrinsics with something like this(var is an __m128 value)(it was for an frustrum-culling test):

if( var.m128_f32[0] > 0.0f && var.m128_f32[1] > 0.0f && var.m128_f32[2] > 0.0f && var.m128_f32[3] > 0.0f ) {
    ...
}

As i took a look at the asm-output i saw that it did compile to an ugly very jumpy version (and i know that the CPU's just hate tight jumps) and i know also that i can optimize it with the SSE4.1 PTEST instruction, but why did the compiler not do it(even if the compiler writers defined the PTEST intrinsic, so they knew the instruction)?

What optimizations can't it do too (until now).

Does this imply that im with the todays technology forced to use intrinsics and inline ASM and linked ASM functions and will compilers ever find such things(i don't think so)?

Where can i read more about how good the MSVC++ compiler optimizes?

(Edit 1): I used the SSE2 switch and FP:fast switch

+3  A: 

You might want to try Intel's ICC compiler - it generates a lot better code than Visual C++, especially for SSE code. You can get a free 30 day evaluation license from intel.com.

Paul R
It's also caught a lot of flak for generating needlessly inefficient code for AMD cpus
jalf
@jalf: I guess that's a moot point, since SSE on AMD CPUs is pretty much useless - you probably want to use Intel CPUs if you're doing serious SIMD work.
Paul R
@Paul: most people write software that has to run on multiple CPU's. Also, I'm not really sure what your problem with AMD's SSE performance is. I'm not aware of any significant limitations on AMD CPU's. Care to elaborate?
jalf
@jalf: AMD still has no support for SSSE3, and its SSE implementation is still 64 bits under the hood (like on pre "Core" Intel CPUs - it takes two clocks to perform a 128 bit operation) so there is a severe performance limitation compared to current generation Intel CPUs which have SSSE3 and full 128 bit execution units.
Paul R
+2  A: 

You can activate asm view of the compiled code and see yourself what is generated.

Klaim
i did it (well, i have written it this way, PTEST is an asm instruction), but the question was just why the compiler didn't use this optimization... maybe because the MSVC++ guys didn't thought about such an use/abuse...
Quonux
A: 

Check the presentation at http://lambda-the-ultimate.org/node/3674

Summary: Compilers generally do lots of amazing tricks now, even things that doesn't seem to be generally related to imperative programming, like tail-call optimization. MSVC++ is not the best, still it seems pretty good.

liori
+3  A: 

The default for the compiler is set to generate code that wil run on a 'lowest common denominator' CPU - ie one without SSE 4.1 instructions.

You can change that by targetting later CPUs only in the build options.

That said, the MS compiler is traditionally 'not the best' when it comes to SSE optimisation. I'm not even sure if it supports SSE 4 at all. That link gives good credit to GCC for SSE optimisation:

As a side note about GCC’s near perfection in code generation – I was quite surprised seeing it surpass even Intel’s own compiler

Perhaps you need to change compiler!

gbjbaanb
ok, i had forget to mention that i setted it to SSE2, maybe there need to be a SSE4.1 switch ;). And thx for the GCC hint, ill check it out soon and try to squeeze it out :P
Quonux