views:

110

answers:

3

Hi, my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.

I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.

How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into one of the 128-bit NEON registers? What intrinsic do I have to use to do this?

I mean:

alt text

I must load them as 32 bits because if you look carefully, the moment I do 255 + 255 is 512, which can't be held in a 8-bit register.

e.g.

255 255 255 255 ......... (640 pixels)
255 255 255 255
255 255 255 255
255 255 255 255
.
.
.
.
.
(480 pixels) 
+1  A: 

Depends on your compiler and (possible lack of) extensions.

Ie. for GCC, this might be a starting point: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

domen
+4  A: 

I will recommend that you spend a bit of time understanding how SIMD works on ARM. Look at:

Take a look at:

  1. http://blogs.arm.com/software-enablement/coding-for-neon-part-1-load-and-stores/
  2. http://blogs.arm.com/software-enablement/coding-for-neon-part-2-dealing-with-leftovers/
  3. http://blogs.arm.com/software-enablement/coding-for-neon-part-3-matrix-multiplication/
  4. http://blogs.arm.com/software-enablement/coding-for-neon-part-4-shifting-left-and-right/

to get you started. You can then implement your SIMD code using inline assembler or corresponding ARM intrinsics recommended by domen.

doron
A: 

If you need to sum up to 480 8-bit values then you would technically need 17 bits of intermediate storage. However, if you perform the additions in two stages, ie, top 240 rows then bottom 240 rows, you can do it in 16-bits each. Then you can add the results from the two halves to get the final answer.

There is actually a NEON instruction that is suitable for your algorithm called vaddw. It will add a dword vector to a qword vector, with the latter containing elements that are twice as wide as the former. In your case, vaddw.u8 can be used to add 8 pixels to 8 16-bit accumulators. Then, vaddw.u16 can be used to add the two sets of 8 16-bit accumulators into one set of 8 32-bit ones - note that you must use the instruction twice to get both halves.

If necessary, you can also convert the values back to 16-bit or 8-bit by using vmovn or vqmovn.

Exophase