views:

83

answers:

1

I have a

A = a1 a2 a3 a4
    b1 b2 b3 b4
    c1 c2 c3 c4
    d1 d2 d3 d4

I have 2 rows with me,

float32x2_t a = a1 a2
float32x2_t b = b1 b2

From these how can I get a -

float32x4_t result = b1 a1 b2 a2

Is there any single NEON SIMD instruction which can merge these two rows? Or how can I achieve this using as minimum steps as possible using intrinsics?

I thought of using the zip/unzip intrinsics but the datatype the zip function returns, which is float32x2x2_t, is not suitable for me, I need a float32x4_t datatype.

float32x2x2_t vzip_f32 (float32x2_t, float32x2_t)
+1  A: 

This is difficult.. There is not a single instruction that can do this, and the best solution depends on if your data is in memory or if they are already in registers.

You need two operations at least to do the conversion.. First a vector turn which permutes your arguments like this:

a = a1 a2
b = b1 b2

vtrn.32  a, b

a = a1 b1 
b = a2 b2

And then you have to swap the arguments of each operation. Either by reversing each vector on it's own or by treating the two vectors as a quad vector and do a long reverse.

temp = {a, b} 
temp = a1 b1 a2 b2

vrev64.32 temp, temp

temp = b1 a1 b2 a2    <-- this is what you want.

If you load your data from memory you can skip the first vtrn.32 instruction because NEON can do this while it loads the data using the vld2.32 instruction. Here is a little assembler function that does just that:

.globl asmtest

asmtest:
        vld2.32     {d0-d1}, [r0]   # load two vectors and transose
        vrev64.32   q0, q0          # reverse within d0 and d1
        vst1.32     {d0-d1}, [r0]   # store result
        mov pc, lr                  # return from subroutine..

Btw, a little note: The instructions vtrn.32, vzip.32 and vuzp.32 are identical (but only if you're working with 32 bit entities)

And with NEON intrinsics? Well - simply said you're screwed. As you've already found out you can't directly cast from one type to another and you can't directly mix quad and double vectors.

This is the best I came up with using intrinsics (it does not use the vld2.32 trick for readability):

int main (int argc, char **args)
{
  const float32_t data[4] =
  {
    1, 2, 3, 4
  };

  float32_t     output[4];

  /* load test vectors */
  float32x2_t   a = vld1_f32 (data + 0);
  float32x2_t   b = vld1_f32 (data + 2);

  /* transpose and convert to float32x4_t */
  float32x2x2_t temp   = vzip_f32 (b,a);
  float32x4_t   result = vcombine_f32 (temp.val[0], temp.val[1]);

  /* store for printing */
  vst1q_f32 (output, result);

  /* print out the original and transposed result */
  printf ("%f %f %f %f\n", data[0],   data[1],   data[2],   data[3]);
  printf ("%f %f %f %f\n", output[0], output[1], output[2], output[3]);
}

If you're using GCC this will work, but the code generated by GCC will be horrible and slow. NEON intrinsic support is still very young. You'll probably get better performance with a straight forward C-code here..

Nils Pipenbrinck
Hello Nils, I'm having trouble compiling for NEON, I get some strange errors while compiling my code using code sourcery compiler, I'm not able to understand what error it is. Can you kindly take a look at my question and suggest me what to do?http://stackoverflow.com/questions/3811148/unknown-gcc-error-while-compiling-for-arm-neon-critical
vikramtheone