views:

598

answers:

3

I'm trying to implement some inline assembler (in C/C++ code) to take advantage of SSE. I'd like to copy and duplicate values (from an XMM register, or from memory) to another XMM register. For example, suppose I have some values {1, 2, 3, 4} in memory. I'd like to copy these values such that xmm1 is populated with {1, 1, 1, 1}, xmm2 with {2, 2, 2, 2}, and so on and so forth.

Looking through the Intel reference manuals, I couldn't find an instruction to do this. Do I just need to use a combination of repeated MOVSS and rotates (via PSHUFD?)?

+3  A: 

Move the source to the dest register. Use 'shufps' and just use the new dest register twice and then select the appropriate mask.

The following example broadcasts the values of XMM2.x to XMM0.xyzw

MOVAPS XMM0, XMM2
SHUFPS XMM0, XMM0, 0x00
Adisak
+10  A: 

There are two ways:

  1. Use shufps exclusively:

    __m128 first = ...;
    __m128 xxxx = _mm_shuffle_ps(first, first, 0x00); // _MM_SHUFFLE(0, 0, 0, 0)
    __m128 yyyy = _mm_shuffle_ps(first, first, 0x55); // _MM_SHUFFLE(1, 1, 1, 1)
    __m128 zzzz = _mm_shuffle_ps(first, first, 0xAA); // _MM_SHUFFLE(2, 2, 2, 2)
    __m128 wwww = _mm_shuffle_ps(first, first, 0xFF); // _MM_SHUFFLE(3, 3, 3, 3)
    
  2. Let the compiler choose the best way using _mm_set1_ps and _mm_cvtss_f32:

    __m128 first = ...;
    __m128 xxxx = _mm_set1_ps(_mm_cvtss_f32(first));
    

Note that the 2nd method will produce horrible code on MSVC, as discussed here, and will only produce 'xxxx' as result, unlike the first option.

I'm trying to implement some inline assembler (in C/C++ code) to take advantage of SSE

This is highly unportable. Use intrinsics.

LiraNuna
That's a very good point about portability. I hadn't really thought of it since this is mostly a learning exercise for me.Your article also looks very interesting at first glance. I'm looking forward to spending some more time with it.
jbl
The intrinsic method shown in this answer is better than inline asm because intrinsics allow the compiler to do many more optimizations that are not performed on inlined asm: register assignment, loop unrolling, instruction interleaving, hoisting invariants out of loops, etc. My answer was ASM because that's what the original question asked for but if I were going to use the code myself, I would write it with Intrinsics for PERFORMANCE _AND_ PORTABILITY.
Adisak
Adisak: what you said is true for anything but MSVC - it handles intrinsics very poorly (see my article). in MSVC, hand-written assembly is better if performance comes before portability and maintainability (rarely). I would just suggest to switch compilers though :).
LiraNuna
Well, at least the potential for optimization is there for Intrinsics. It's sad to hear that MSVC implements them poorly. Hopefully that will get addressed in the near future for VS2010.
Adisak
Well, it doesn't. Same results as VC2008 (for now, at least).
LiraNuna
A: 

If your values are 16 byte aligned in memory:

movdqa    (mem),    %xmm1
pshufd    $0xff,    %xmm1,    %xmm4
pshufd    $0xaa,    %xmm1,    %xmm3
pshufd    $0x55,    %xmm1,    %xmm2
pshufd    $0x00,    %xmm1,    %xmm1

If not, you can do an unaligned load, or four scalar loads. On newer platforms, the unaligned load should be faster; on older platforms the scalar loads may win.

As others have noted, you can also use shufps.

Stephen Canon
Note: `pshufd` is SSE2 instruction.
LiraNuna
@LiraNuna: I took the questioner's use of "SSE" to mean some unspecified subset of SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, etc. Since essentially all x86 hardware has had SSE2 support for quite some number of years now, it seemed pretty safe to assume that the questioner didn't mean to proscribe it.
Stephen Canon
It's a general note - it was not aimed to be against your answer in any way.
LiraNuna