ansaurus

Question

How do you populate an x86 XMM register with 4 identical floats from another XMM register entry?

Answer 1

+3 A:

Move the source to the dest register. Use 'shufps' and just use the new dest register twice and then select the appropriate mask.

The following example broadcasts the values of XMM2.x to XMM0.xyzw

MOVAPS XMM0, XMM2
SHUFPS XMM0, XMM0, 0x00

Adisak 2010-01-06 20:05:34

Answer 2

+10 A:

There are two ways:

Use shufps exclusively:

__m128 first = ...;
__m128 xxxx = _mm_shuffle_ps(first, first, 0x00); // _MM_SHUFFLE(0, 0, 0, 0)
__m128 yyyy = _mm_shuffle_ps(first, first, 0x55); // _MM_SHUFFLE(1, 1, 1, 1)
__m128 zzzz = _mm_shuffle_ps(first, first, 0xAA); // _MM_SHUFFLE(2, 2, 2, 2)
__m128 wwww = _mm_shuffle_ps(first, first, 0xFF); // _MM_SHUFFLE(3, 3, 3, 3)

Let the compiler choose the best way using _mm_set1_ps and _mm_cvtss_f32:

__m128 first = ...;
__m128 xxxx = _mm_set1_ps(_mm_cvtss_f32(first));

Note that the 2nd method will produce horrible code on MSVC, as discussed here, and will only produce 'xxxx' as result, unlike the first option.

I'm trying to implement some inline assembler (in C/C++ code) to take advantage of SSE

This is highly unportable. Use intrinsics.

LiraNuna 2010-01-06 20:07:24

That's a very good point about portability. I hadn't really thought of it since this is mostly a learning exercise for me.Your article also looks very interesting at first glance. I'm looking forward to spending some more time with it.

jbl 2010-01-07 02:13:05

The intrinsic method shown in this answer is better than inline asm because intrinsics allow the compiler to do many more optimizations that are not performed on inlined asm: register assignment, loop unrolling, instruction interleaving, hoisting invariants out of loops, etc. My answer was ASM because that's what the original question asked for but if I were going to use the code myself, I would write it with Intrinsics for PERFORMANCE _AND_ PORTABILITY.

Adisak 2010-01-07 19:47:42

Adisak: what you said is true for anything but MSVC - it handles intrinsics very poorly (see my article). in MSVC, hand-written assembly is better if performance comes before portability and maintainability (rarely). I would just suggest to switch compilers though :).

LiraNuna 2010-01-07 20:28:36

Well, at least the potential for optimization is there for Intrinsics. It's sad to hear that MSVC implements them poorly. Hopefully that will get addressed in the near future for VS2010.

Adisak 2010-01-07 20:59:32

Well, it doesn't. Same results as VC2008 (for now, at least).

LiraNuna 2010-01-07 21:14:13

Answer 3

A:

If your values are 16 byte aligned in memory:

movdqa    (mem),    %xmm1
pshufd    $0xff,    %xmm1,    %xmm4
pshufd    $0xaa,    %xmm1,    %xmm3
pshufd    $0x55,    %xmm1,    %xmm2
pshufd    $0x00,    %xmm1,    %xmm1

If not, you can do an unaligned load, or four scalar loads. On newer platforms, the unaligned load should be faster; on older platforms the scalar loads may win.

As others have noted, you can also use shufps.

Stephen Canon 2010-01-06 20:08:45

Note: `pshufd` is SSE2 instruction.

LiraNuna 2010-03-03 00:13:51

@LiraNuna: I took the questioner's use of "SSE" to mean some unspecified subset of SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, etc. Since essentially all x86 hardware has had SSE2 support for quite some number of years now, it seemed pretty safe to assume that the questioner didn't mean to proscribe it.

Stephen Canon 2010-03-03 00:29:07

It's a general note - it was not aimed to be against your answer in any way.

LiraNuna 2010-03-03 00:30:57

ansaurus

tags:

views:

answers:

How do you populate an x86 XMM register with 4 identical floats from another XMM register entry?

related questions