This is difficult.. There is not a single instruction that can do this, and the best solution depends on if your data is in memory or if they are already in registers.
You need two operations at least to do the conversion.. First a vector turn which permutes your arguments like this:
a = a1 a2
b = b1 b2
vtrn.32 a, b
a = a1 b1
b = a2 b2
And then you have to swap the arguments of each operation. Either by reversing each vector on it's own or by treating the two vectors as a quad vector and do a long reverse.
temp = {a, b}
temp = a1 b1 a2 b2
vrev64.32 temp, temp
temp = b1 a1 b2 a2 <-- this is what you want.
If you load your data from memory you can skip the first vtrn.32 instruction because NEON can do this while it loads the data using the vld2.32 instruction. Here is a little assembler function that does just that:
.globl asmtest
asmtest:
vld2.32 {d0-d1}, [r0] # load two vectors and transose
vrev64.32 q0, q0 # reverse within d0 and d1
vst1.32 {d0-d1}, [r0] # store result
mov pc, lr # return from subroutine..
Btw, a little note: The instructions vtrn.32, vzip.32 and vuzp.32 are identical (but only if you're working with 32 bit entities)
And with NEON intrinsics? Well - simply said you're screwed. As you've already found out you can't directly cast from one type to another and you can't directly mix quad and double vectors.
This is the best I came up with using intrinsics (it does not use the vld2.32 trick for readability):
int main (int argc, char **args)
{
const float32_t data[4] =
{
1, 2, 3, 4
};
float32_t output[4];
/* load test vectors */
float32x2_t a = vld1_f32 (data + 0);
float32x2_t b = vld1_f32 (data + 2);
/* transpose and convert to float32x4_t */
float32x2x2_t temp = vzip_f32 (b,a);
float32x4_t result = vcombine_f32 (temp.val[0], temp.val[1]);
/* store for printing */
vst1q_f32 (output, result);
/* print out the original and transposed result */
printf ("%f %f %f %f\n", data[0], data[1], data[2], data[3]);
printf ("%f %f %f %f\n", output[0], output[1], output[2], output[3]);
}
If you're using GCC this will work, but the code generated by GCC will be horrible and slow. NEON intrinsic support is still very young. You'll probably get better performance with a straight forward C-code here..