Hello!
Given the arrays:
int canvas[10][10];
int addon[10][10];
Where all the values range from 0 - 100, what is the fastest way in C++ to add those two arrays so each cell in canvas equals itself plus the corresponding cell value in addon?
IE, I want to achieve something like:
canvas += another;
So if canvas[0][0] =3 and addon[0...
Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixel...
Hi, couldn't seem to find anything besides opinion questions on 64/32 bit stuff when I searched.
__asm__ {
mov rbx, 0xFFFFffffFFFFffffull
movq mm2, rbx
}
After these 2 instructions the mm2 register holds the value 0x30500004ffffffff according to my xcode debugger (this is inline asm in C++). Now I am new to x86 assembly and my a...
I'm writing a game in Haskell, and my current pass at the UI involves a lot of procedural generation of geometry. I am currently focused on identifying performance of one particular operation (C-ish pseudocode):
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
vecList[i] = vecList[i] * multiplier + addend;
...
I'm wondering if there is a way to extend ruby Array type to do SIMD & SSE vector calculation.
I mean implement in an low-level language to be used in ruby programs to number crunching tasks.
...
I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring).
The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time.
What seems very bad about this code, at least to me, is that there is a lot of insert/extract in...
Hi Guys,
how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?
Help!!!
...
Hi Guys,
I have an image processing algorithm which makes of matrices, I have my own matrix operation codes (Multiplication, Inverse...) with me. But the processor I use is ARM Cortex-A8 processor, which has NEON co-processor for vectorization, as matrix operations are ideal cases for SIMD operations, I asked the compiler (-mfpu=neon -m...
I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer on...
Hi Guys,
I'm making use of an ARM Cortex-A8 based processor and I have several places where I calculate 3x3 Matrix inverse operations.
As the Cortex-a8 processor has a NEON SIMD processor I'm interested to use this co-processor for 3x3 matrix inverse, I saw several 4x4 implementations (Intel SSE and freevec) but no where did I see a 3x...
I have a
A = a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
I have 2 rows with me,
float32x2_t a = a1 a2
float32x2_t b = b1 b2
From these how can I get a -
float32x4_t result = b1 a1 b2 a2
Is there any single NEON SIMD instruction which can merge these two rows?
Or how can I achieve this using as minimum steps as p...
Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature:
PUNPCKLBW xmm1, xmm2/m128
Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything...
I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register.
I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried.
For each case I looped over the code 10 billion times and got the wall-time indicated. I'm ...
Hi.
I have the code:
float *mu_x_ptr;
__m128 *tmp;
__m128 *mm_mu_x;
mu_x_ptr = _aligned_malloc(4*sizeof(float), 16);
mm_mu_x = (__m128*) mu_x_ptr;
for(row = 0; row < ker_size; row++) {
tmp = (__m128*) &original[row*width + col];
*mm_mu_x = _mm_add_ps(*tmp, *mm_mu_x);
}
From this I get:
First-chance exception at 0x00ad192e i...
Hi,
I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
Thank...
(Sorry if this sounds like a rant, but it's a real question and I'd appreciate real answers)
I understand that since C is so old, it might have not made sense to add it back then(MMX didn't even exist back then). But since then there was C99, and still there are no standard for SIMD variables(as far as I know).
By "SIMD variables", I m...
I am using SSE extensions available in Core2Duo processor (compiler gcc 4.4.1). I see that there are 16 registers available each of which is 128 bit long. Now, I can accommodate 4 integer values into a single register, and 4 in another register and using intrinsics I can add them in one instruction. The obvious advantage is this way I re...
Does anyone know of an SSE ehanced version of libtiff? Even just an SSE enhanced version of a CCITT Group4 encoder would do, I could do the work of sliding that one in libtiff myself. I only need to work with bitonal images.
Thank you
...
How to use the NEON comparison instructions in general?
Here is a case, I want to use, Greater-than-or-equal-to instruction?
Currently I have a,
int x;
...
...
...
if(x >= 0)
{
....
}
In NEON, I would like to use x in the same way, just that x this time is a vector.
int32x4_t x;
...
...
...
if(vcgeq_s32(x, vdupq_n_s32(0))) // Wh...
I'd like to optimize the following snippet using SSE instructions if possible:
/*
* the data structure
*/
typedef struct v3d v3d;
struct v3d {
double x;
double y;
double z;
} tmp = { 1.0, 2.0, 3.0 };
/*
* the part that should be "optimized"
*/
tmp.x /= 4.0;
tmp.y /= 4.0;
tmp.z /= 4.0;
Is this possible at all?
...