views:

319

answers:

3

I've done some inline ASM coding for SSE before and it was not too hard even for someone who doesn't know ASM. But I note MS also provide intrinsics wrapping many such special instructions.

Is there a particular performance difference, or any other strong reason why one should be used above the other?

To repeat from the title, this is specifically covering intrinsics exposed by VC++ 2008 for unmanaged, native C++.

+4  A: 

There is no inline assembly in Visual C++ for x64. Intrinsics can be used on x64 as well. If you ever want to port your code to x64, you'll have to use intrinsics.

Michael
+1. In this case, it's not relevant. But thanks for pointing it out, I didn't know this.
John
But you _can_ use MASM x64 if you want
PhiS
+1  A: 

Intrinsics are identical to their equivalent assembly instructions and you should use them if possible - the compiler knows to directly translate them, there is no performance difference.

Paul Betts
Really _identical_? What about moving data into registers and so on, is that covered/relevant?
John
I frequently find that I can write assembly that is ~2x faster than equivalent SSE intrinsics due to compilers botching register allocation and/or instruction scheduling. That said, I write vector code all day, every day. Your mileage may vary.
Stephen Canon
@Stephen: have you tried this comparison with the Intel ICC compiler ? It's pretty hard to beat, IMHO, but I'd be interested to know if you've been able to beat it with hand-coded assembler ?
Paul R
@Paul R: for relatively simple tasks, ICC is usually competitive with hand-written assembly. For complex algorithms with constant register pressure, I find I can usually beat it by a wide margin (however, ICC doesn't take nearly as long to generate its output as I do).
Stephen Canon
@Stephen: thanks - that's interesting - are you working with x86-64 (i.e. 16 SSE registers) or does your register pressure come mainly from being limited to 8 registers ?
Paul R
@Paul R: I see this on both 32- and 64-bit code (though definitely more often in 32-bit code). In cases where a substantial amount of algorithmic horizontal data movement via shuffles, I see compilers (ICC included) occasionally produce register-register moves that aren't strictly necessary (where they could be avoided by careful instruction reordering, for example). If you're already sitting on the cusp of using all your registers, this can force a sequence of spills that saturate the load/store units and cause stalls.
Stephen Canon
@Stephen: thanks for the useful observations - prompted by this I'm going to take a look at some of my ICC-generated code to see if there is room for improvement.
Paul R
+2  A: 

In general it's better to use intrinsics - it's more productive for the programmer and a good compiler (e.g. Intel ICC) will do a decent job of register allocation, instruction scheduling etc. The Microsoft compiler is not as good in this respect but it probably still does a reasonable job - you can always switch to ICC later if you need to get better performance.

Paul R
The productivity argument is the right argument for intrinsics. For most tasks, the resulting code will be good enough that the productivity gains from using intrinsics are far more valuable than the added performance from using assembly. Really, only libraries and small sections that are absolutely performance critical should be written in assembly.
Stephen Canon