Hi all,
This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful.
I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variabl...
I'm involved in one of those challenges where you try to produce the smallest possible binary, so I'm building my program without the C or C++ run-time libraries (RTL). I don't link to the DLL version or the static version. I don't even #include the header files. I have this working fine.
Some RTL functions, like memset(), can be use...
Hi,
I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop befo...
I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines:
float x = a*b;
float y = c*d;
float z = e*f;
float w = g*h;
all a, b, c... are floats.
I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is:
Vector4 abcd...
How can I replace the following 32-bit driver assembly to intrinsic as I am porting over my driver code to 64-bit:
_asm jmp short $+8
...
What are these data types for? __m64, __m128,
__m256 ?
...
I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations:
At the beginning of my program, I create an object with member:
static __m128 *m_sincos;
then I initilize that member in the constructor:
m_sincos = (__m128*) _aligned_malloc(Bins...
Can someone advise me open source format conversion library? Optimized for SSE, SSE2.
Formats for conversion: I420, YUY2, RGB(16-bit, 32-bit).
I found only VirtualDub Kasumi library.
...
Hi,
I could not find any intrinsics for a simple xor operation.
See: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html
Are there really no way to use NEON instructions for this?
...
I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring).
The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time.
What seems very bad about this code, at least to me, is that there is a lot of insert/extract in...
Hi Guys,
how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?
Help!!!
...
I have a
A = a1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
I have 2 rows with me,
float32x2_t a = a1 a2
float32x2_t b = b1 b2
From these how can I get a -
float32x4_t result = b1 a1 b2 a2
Is there any single NEON SIMD instruction which can merge these two rows?
Or how can I achieve this using as minimum steps as p...
Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature:
PUNPCKLBW xmm1, xmm2/m128
Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything...
I'm trying to compile some code that uses the intrinsic _mm_set_epi64x under Visual C++. This intrinsic is supported by VC but only when compiling for x86-64, not for x86-32. I assume this is not an actual limitation of the processor, because other compilers (GCC and Clang) support this intrinsic for both 32 and 64 bit compiles.
My firs...
I implemented a function called abs(). I get this error:
Intrinsic function, cannot be defined
What have I done wrong?
I'm using Visual Studio 2005.
...
Hi,
my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON.
I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns.
How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into...
I'm having some trouble using SSE4.1 intrinsics on hardware that (I think) supports it.
Can anyone tell me if I've missed something?
Building the following code on a MacBookPro5,4 (Penryn):
>g++ -msse sse4.cpp -S -o sse4.asm
#include <stdio.h>
#include <smmintrin.h>
int main ()
{
__m128 a, b;
const int mask = 0x55;
a.m1...
How to use the NEON comparison instructions in general?
Here is a case, I want to use, Greater-than-or-equal-to instruction?
Currently I have a,
int x;
...
...
...
if(x >= 0)
{
....
}
In NEON, I would like to use x in the same way, just that x this time is a vector.
int32x4_t x;
...
...
...
if(vcgeq_s32(x, vdupq_n_s32(0))) // Wh...
I have recently started using Neon intrinsics in my iOS image convolution code and have a shaky grasp at best. Right now, I get to the pixel data from CGBitmapContextGetData (cgctx); but I would like to take advantage of de-interleaving using vld4 (ARGB data). What is the best way to do this? I'm sure it's one of those simple things I ...
I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables.
What I'm trying to do is store specific bytes from the result values to non-contiguous memory locations. I'm currently doin...