simd

C++ Adding 2 arrays together quickly

Hello! Given the arrays: int canvas[10][10]; int addon[10][10]; Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon? IE, I want to achieve something like: canvas += another; So if canvas[0][0] =3 and addon[0...

Fast 24-bit array -> 32-bit array conversion?

Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixel...

How do I load all 1's into a mmx register? Why doesn't this work?

Hi, couldn't seem to find anything besides opinion questions on 64/32 bit stuff when I searched. __asm__ { mov rbx, 0xFFFFffffFFFFffffull movq mm2, rbx } After these 2 instructions the mm2 register holds the value 0x30500004ffffffff according to my xcode debugger (this is inline asm in C++). Now I am new to x86 assembly and my a...

Haskell math performance on multiply-add operation

I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode): Vec4f multiplier, addend; Vec4f vecList[]; for (int i = 0; i < count; i++) vecList[i] = vecList[i] * multiplier + addend; ...

Ruby SIMD & SSE

I'm wondering if there is a way to extend ruby Array type to do SIMD & SSE vector calculation. I mean implement in an low-level language to be used in ruby programs to number crunching tasks. ...

SIMD/SSE newbie: simple image filtering

I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring). The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time. What seems very bad about this code, at least to me, is that there is a lot of insert/extract in...

How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

Hi Guys, how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t); Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns? Help!!! ...

How to use Eigen, the C++ template library for linear algebra?

Hi Guys, I have an image processing algorithm which makes of matrices, I have my own matrix operation codes (Multiplication, Inverse...) with me. But the processor I use is ARM Cortex-A8 processor, which has NEON co-processor for vectorization, as matrix operations are ideal cases for SIMD operations, I asked the compiler (-mfpu=neon -m...

Is SIMD Worth It? Is there a better option?

I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer on...

Is 3x3 Matrix inverse possible using SIMD instructions?

Hi Guys, I'm making use of an ARM Cortex-A8 based processor and I have several places where I calculate 3x3 Matrix inverse operations. As the Cortex-a8 processor has a NEON SIMD processor I'm interested to use this co-processor for 3x3 matrix inverse, I saw several 4x4 implementations (Intel SSE and freevec) but no where did I see a 3x...

How to merge elements of 2 rows using NEON SIMD?

I have a A = a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4 I have 2 rows with me, float32x2_t a = a1 a2 float32x2_t b = b1 b2 From these how can I get a - float32x4_t result = b1 a1 b2 a2 Is there any single NEON SIMD instruction which can merge these two rows? Or how can I achieve this using as minimum steps as p...

SSE2 intrinsics: access memory directly

Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature: PUNPCKLBW xmm1, xmm2/m128 Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything...

Flipping sign on packed SSE floats.

I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register. I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried. For each case I looped over the code 10 billion times and got the wall-time indicated. I'm ...

SSE access violation

Hi. I have the code: float *mu_x_ptr; __m128 *tmp; __m128 *mm_mu_x; mu_x_ptr = _aligned_malloc(4*sizeof(float), 16); mm_mu_x = (__m128*) mu_x_ptr; for(row = 0; row < ker_size; row++) { tmp = (__m128*) &original[row*width + col]; *mm_mu_x = _mm_add_ps(*tmp, *mm_mu_x); } From this I get: First-chance exception at 0x00ad192e i...

What is the limit of optimization using SIMD?

Hi, I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case. Do you think the use of vector operators could give bigger speedups? Thank...

Rationale for no primitive SIMD data types

(Sorry if this sounds like a rant, but it's a real question and I'd appreciate real answers) I understand that since C is so old, it might have not made sense to add it back then(MMX didn't even exist back then). But since then there was C99, and still there are no standard for SIMD variables(as far as I know). By "SIMD variables", I m...

SIMD Programming

I am using SSE extensions available in Core2Duo processor (compiler gcc 4.4.1). I see that there are 16 registers available each of which is 128 bit long. Now, I can accommodate 4 integer values into a single register, and 4 in another register and using intrinsics I can add them in one instruction. The obvious advantage is this way I re...

SSE enhanced libtiff/CCITT Fax4 encoder

Does anyone know of an SSE ehanced version of libtiff? Even just an SSE enhanced version of a CCITT Group4 encoder would do, I could do the work of sliding that one in libtiff myself. I only need to work with bitonal images. Thank you ...

How to use NEON comparison (greater than or equal to) instruction?

How to use the NEON comparison instructions in general? Here is a case, I want to use, Greater-than-or-equal-to instruction? Currently I have a, int x; ... ... ... if(x >= 0) { .... } In NEON, I would like to use x in the same way, just that x this time is a vector. int32x4_t x; ... ... ... if(vcgeq_s32(x, vdupq_n_s32(0))) // Wh...

SIMD (SSE) instruction for division in GCC

I'd like to optimize the following snippet using SSE instructions if possible: /* * the data structure */ typedef struct v3d v3d; struct v3d { double x; double y; double z; } tmp = { 1.0, 2.0, 3.0 }; /* * the part that should be "optimized" */ tmp.x /= 4.0; tmp.y /= 4.0; tmp.z /= 4.0; Is this possible at all? ...