sse

SIMD/SSE newbie: simple image filtering

I'm very new to SIMD/SSE and I'm trying to do some simple image filtering (blurring). The code below filters each pixel of a 8-bit gray bitmap with a simple [1 2 1] weighting in horizontal direction. I'm creating sums of 16 pixels at a time. What seems very bad about this code, at least to me, is that there is a lot of insert/extract in...

How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function

Hi, I'm writing a SSE code to 2D convolution but SSE documentation is very sparse. I'm calculating dot product with _mm_dp_ps and using _mm_extract_ps to get the dot product result, but _mm_extract_ps returns a hex that represents a float and I can't figure out how to convert this hex float to a regular float. I could use __builtin_ia3...

Complex Mul and Div using sse Instructions

Is Performing Complex Multiplication and Division beneficial through SSE Instructions. I know that Addition and Subtraction does perform better in the SSE Land. Can some one tell me how I can use to perform complex multiplication to get a better performance. ...

SSE Alignment with class

Having some really weird problem and as beginner with c++ I don't know why. struct DeviceSettings { public: ....somevariables DXSize BackbufferSize; ....somemethods }; struct DXPoint; typedef DXPoint DXSize; __declspec(align(16)) struct DXPoint { public: union { struct { int x; int ...

Ensure compiler always use SSE sqrt instruction

I'm trying to get GCC (or clang) to consistently use the SSE instruction for sqrt instead of the math library function for a computationally intensive scientific application. I've tried a variety of GCCs on various 32 and 64 bit OS X and Linux systems. I'm making sure to enable sse with -mfpmath=sse (and -march=core2 to satisfy GCCs requ...

SSE2 intrinsics: access memory directly

Many SSE instructions allow the source operand to be a 16-byte aligned memory address. For example, the various (un)pack instructions. PUNCKLBW has the following signature: PUNPCKLBW xmm1, xmm2/m128 Now this doesn't seem to be possible at all with intrinsics. It looks like it's mandatory to use _mm_load* intrinsics to read anything...

Flipping sign on packed SSE floats.

I'm looking for the most efficient method of flipping the sign on all four floats packed in an SSE register. I have not found an intrinsic for doing this in the Intel Architecture software dev manual. Below are the things I've already tried. For each case I looped over the code 10 billion times and got the wall-time indicated. I'm ...

Compute the absolute difference between unsigned integers using SSE

In C is there a branch-less technique to compute the absolute difference between two unsigned ints? For example given the variables a and b, I would like the value 2 for cases when a=3, b=5 or b=3, a=5. Ideally I would also like to be able to vectorize the computation using the SSE registers. ...

SSE access violation

Hi. I have the code: float *mu_x_ptr; __m128 *tmp; __m128 *mm_mu_x; mu_x_ptr = _aligned_malloc(4*sizeof(float), 16); mm_mu_x = (__m128*) mu_x_ptr; for(row = 0; row < ker_size; row++) { tmp = (__m128*) &original[row*width + col]; *mm_mu_x = _mm_add_ps(*tmp, *mm_mu_x); } From this I get: First-chance exception at 0x00ad192e i...

64 bits binary form in assembler....

I'm kind a of making a "JIT" for a numeric routine that I need to compute fast, for x86-64. I'm only using sse instructions for arithmetics and of course some moves. My application generates all of those by simply writing the binary form of machine instructions to some part of memory and then executing. For getting the binary form of ins...

SIMD Programming

I am using SSE extensions available in Core2Duo processor (compiler gcc 4.4.1). I see that there are 16 registers available each of which is 128 bit long. Now, I can accommodate 4 integer values into a single register, and 4 in another register and using intrinsics I can add them in one instruction. The obvious advantage is this way I re...

SSE enhanced libtiff/CCITT Fax4 encoder

Does anyone know of an SSE ehanced version of libtiff? Even just an SSE enhanced version of a CCITT Group4 encoder would do, I could do the work of sliding that one in libtiff myself. I only need to work with bitonal images. Thank you ...

sse inline assembly with g++

I'm trying out g++ inline assembly and sse and wrote a first program. It segfaults - why? #include <stdio.h> float s[128*4] __attribute__((aligned(16))); #define r0 3 #define r1 17 #define r2 110 #define rs0 "3" #define rs1 "17" #define rs2 "110" int main () { s[r0*4+0] = 2.0; s[r0*4+1] = 3.0; s[r0*4+2] = 4.0; s[r0*4+3] = 5.0; ...

SSE4.1 intrinsics compilation error on Mac

I'm having some trouble using SSE4.1 intrinsics on hardware that (I think) supports it. Can anyone tell me if I've missed something? Building the following code on a MacBookPro5,4 (Penryn): >g++ -msse sse4.cpp -S -o sse4.asm #include <stdio.h> #include <smmintrin.h> int main () { __m128 a, b; const int mask = 0x55; a.m1...

SIMD (SSE) instruction for division in GCC

I'd like to optimize the following snippet using SSE instructions if possible: /* * the data structure */ typedef struct v3d v3d; struct v3d { double x; double y; double z; } tmp = { 1.0, 2.0, 3.0 }; /* * the part that should be "optimized" */ tmp.x /= 4.0; tmp.y /= 4.0; tmp.z /= 4.0; Is this possible at all? ...

An SSE Stdlib-esque Library?

Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC. As a quick general rundown: these would be th...

DPPS on older version GCC

Hei! I need to optimize some matrix multiplication code in c, and I'm doing it using SSE vector instructions. I also found that there exists SSE4.1 that already has instruction for dot-product, dpps. The problem is that on machine this software is supposed to work there is an old version of gcc installed (4.1.2), which has no support f...

How To Store Values In Non-Contiguous Memory Locations With SSE Intrinsics?

I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables. What I'm trying to do is store specific bytes from the result values to non-contiguous memory locations. I'm currently doin...

SSE3 instructions in F#

How do I parallelize my F# program using SSE3 instruction set? Does the F# compiler support it? ...

What is my compiler doing? (optimizing memcpy)

I'm compiling a bit of code using the following settings in VC++2010: /O2 /Ob2 /Oi /Ot However I'm having some trouble understanding some parts of the assembly generated, I have put some questions in the code as comments. Also, what prefetching distance is generally recommended on modern cpus? I can ofc test on my own cpu, but I was h...