intrinsics

Intrinsics program (SSE) - g++ - help needed

Hi all, This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful. I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variabl...

How to use VC++ intrinsic functions w/o run-time library

I'm involved in one of those challenges where you try to produce the smallest possible binary, so I'm building my program without the C or C++ run-time libraries (RTL). I don't link to the DLL version or the static version. I don't even #include the header files. I have this working fine. Some RTL functions, like memset(), can be use...

g++ SSE intrinsics dilemma - value from intrinsic "saturates"

Hi, I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop befo...

Help with Assembly/SSE Multiplication

I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines: float x = a*b; float y = c*d; float z = e*f; float w = g*h; all a, b, c... are floats. I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is: Vector4 abcd...

replace _asm with intrinsic equivalent

How can I replace the following 32-bit driver assembly to intrinsic as I am porting over my driver code to 64-bit: _asm jmp short $+8 ...

Data types for x86-64 processors

What are these data types for? __m64, __m128, __m256 ? ...

Why does my data not seem to be aligned?

I'm trying to figure out how to best pre-calculate some sin and cosine values, store them in aligned blocks, and then use them later for SSE calculations: At the beginning of my program, I create an object with member: static __m128 *m_sincos; then I initilize that member in the constructor: m_sincos = (__m128*) _aligned_malloc(Bins...

Fast format conversion open source library

Can someone advise me open source format conversion library? Optimized for SSE, SSE2. Formats for conversion: I420, YUY2, RGB(16-bit, 32-bit). I found only VirtualDub Kasumi library. ...

No xor gcc intrinsics for ARM NEON

Hi, I could not find any intrinsics for a simple xor operation. See: http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html Are there really no way to use NEON instructions for this? ...

SIMD/SSE newbie: simple image filtering

I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring). The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time. What seems very bad about this code, at least to me, is that there is a lot of insert/extract in...

How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

Hi Guys, how to use the Multiply-Accumulate intrinsics provided by GCC? float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t); Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns? Help!!! ...

How to merge elements of 2 rows using NEON SIMD?

I have a A = a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4 I have 2 rows with me, float32x2_t a = a1 a2 float32x2_t b = b1 b2 From these how can I get a - float32x4_t result = b1 a1 b2 a2 Is there any single NEON SIMD instruction which can merge these two rows? Or how can I achieve this using as minimum steps as p...

SSE2 intrinsics: access memory directly

Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature: PUNPCKLBW xmm1, xmm2/m128 Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything...

What can be used to replace _mm_set_epi64x on 32-bit Windows?

I'm trying to compile some code that uses the intrinsic _mm_set_epi64x under Visual C++. This intrinsic is supported by VC but only when compiling for x86-64, not for x86-32. I assume this is not an actual limitation of the processor, because other compilers (GCC and Clang) support this intrinsic for both 32 and 64 bit compiles. My firs...

Intrinsic function, cannot be defined (C)

I implemented a function called abs(). I get this error: Intrinsic function, cannot be defined What have I done wrong? I'm using Visual Studio 2005. ...

ARM NEON: How to load 8bit uint8_t as uint32_t?

Hi, my image processing project works with grayscale images. I have ARM Cortex-A8 processor platform. I want to make use of the NEON. I have a grayscale image( consider the example below) and in my alogorithm, I have to add only the columns. How can I load four 8-bit pixel values in parallel, which are uint8_t, as four uint32_t into...

SSE4.1 intrinsics compilation error on Mac

I'm having some trouble using SSE4.1 intrinsics on hardware that (I think) supports it. Can anyone tell me if I've missed something? Building the following code on a MacBookPro5,4 (Penryn): >g++ -msse sse4.cpp -S -o sse4.asm #include <stdio.h> #include <smmintrin.h> int main () { __m128 a, b; const int mask = 0x55; a.m1...

How to use NEON comparison (greater than or equal to) instruction?

How to use the NEON comparison instructions in general? Here is a case, I want to use, Greater-than-or-equal-to instruction? Currently I have a, int x; ... ... ... if(x >= 0) { .... } In NEON, I would like to use x in the same way, just that x this time is a vector. int32x4_t x; ... ... ... if(vcgeq_s32(x, vdupq_n_s32(0))) // Wh...

Neon Intrinsics in iOS

I have recently started using Neon intrinsics in my iOS image convolution code and have a shaky grasp at best. Right now, I get to the pixel data from CGBitmapContextGetData (cgctx); but I would like to take advantage of de-interleaving using vld4 (ARGB data). What is the best way to do this? I'm sure it's one of those simple things I ...

How To Store Values In Non-Contiguous Memory Locations With SSE Intrinsics?

I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables. What I'm trying to do is store specific bytes from the result values to non-contiguous memory locations. I'm currently doin...