Hello,
I'm writing transpose function for 8x16bit vectors with SSE2 intrinsics. Since there are 8 arguments for that function (a matrix of 8x8x16bit size), I can't do anything but pass them by reference. Will that be optimized by the compiler (I mean, will these __m128i objects be passed in registers instead of stack)?
Code snippet:
inline void transpose (__m128i &a0, __m128i &a1, __m128i &a2, __m128i &a3,
__m128i &a4, __m128i &a5, __m128i &a6, __m128i &a7) {
....
}