views:

100

answers:

2

Hi Guys,

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?

Help!!!

+4  A: 

Google'd for vmlaq_f32, turned up the reference for the RVCT compiler tools. Here's what it says:

Vector multiply accumulate: vmla -> Vr[i] := Va[i] + Vb[i] * Vc[i]
...
float32x4_t vmlaq_f32 (float32x4_t a, float32x4_t b, float32x4_t c);

AND

The following types are defined to represent vectors. NEON vector data types are named according to the following pattern: <type><size>x<number of lanes>_t For example, int16x4_t is a vector containing four lanes each containing a signed 16-bit integer. Table E.1 lists the vector data types.

IOW, the return value from the function will be a vector containing 4 32-bit floats, and each element of the vector is calculated by multiplying the corresponding elements of b and c, and adding the contents of a.

HTH

Aidan Cully
+4  A: 

Simply said the vmla instruction does the following:

struct 
{
  float val[4];
} float32x4_t


float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
  float32x4 result;

  for (int i=0; i<4; i++)
  {
    result.val[i] =  b.val[i]*c.val[i]+a.val[i];
  }

  return result;
}

And all this compiles into a singe assembler instruction :-)

You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:

float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
  /* in a perfect world this code would compile into just four instructions */
  float32x4_t result;

  result = vml (matrix[0], vector);
  result = vmla (result, matrix[1], vector);
  result = vmla (result, matrix[2], vector);
  result = vmla (result, matrix[3], vector);

  return result;
}

This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.

Nils Pipenbrinck
Thank you for such a detailed reply. Your reply not just explains the instruction's functionality but also the pros and cons for using this instruction.
vikramtheone
Hi Nils, I understood how the matrix multiplication can be sped up using the NEON instructions. Its really addictive now :) I want to use the NEON instructions to do inverse of a matrix, can you point me to some good documents which explain how to use NEON instructions to do the inverse a matrix or can you give me any ideas, how to go about that? Thank you.
vikramtheone
for matrix inverse I'd do a google search on "sse matrix inverse" and port the sse code to NEON. The usual way is to calculate the inverse for small matrices (4x4) is via Cramers rule.
Nils Pipenbrinck
Nils can you please take a look at this related question of mine? Also can you please compile my example code I have posted there and tell me if the compiler is able to generate NEON SIMD instructions for the matrix multiplication? Thank you.[http://stackoverflow.com/questions/3307821/how-to-verify-vectorization-for-eigen-the-c-template-library-for-linear-algebr]
vikramtheone