sse

error A2070: invalid instruction operands IN SSE MASM64

when compiling this in ml64.exe 64bit (masm64) the SSE command give me an error what do i need to do to include the SSE commands in 64 bit? .code foo PROC movlps [rdx], xmm7 ;;error A2070: invalid instruction operands movhlps xmm6, xmm7 movss [rdx+8], xmm6 ;;rror A2070: invalid instruction operands ret foo ENDP end i get t...

transpose for 8 registers of 16-bit elements on SSE2/SSSE3

(I'm a newbie to SSE/asm, apologies if this is obvious or redundant) Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.) Say you have registers v[0-7] ...

Intrinsics Vs inline ASM for SSE coding in VC++ 2K8

I've done some inline ASM coding for SSE before and it was not too hard even for someone who doesn't know ASM. But I note MS also provide intrinsics wrapping many such special instructions. Is there a particular performance difference, or any other strong reason why one should be used above the other? To repeat from the title, this is ...

OpenMP + SSE gives no speedup

Hi, My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running th...

How much effort do you have to put in to get gains from using SSE?

Case One Say you have a little class: class Point3D { private: float x,y,z; public: operator+=() ...etc }; Point3D &Point3D::operator+=(Point3D &other) { this->x += other.x; this->y += other.y; this->z += other.z; } A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we e...

Can one construct a "good" hash function using CRC32C as a base.

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that? Update How about this? Only 16 ...

GCC - How to realign stack?

I try to build an application which uses pthreads and __m128 SSE type. According to GCC manual, default stack alignment is 16 bytes. In order to use __m128, the requirement is the 16-byte alignment. My target CPU supports SSE. I use a GCC compiler which doesn't support runtime stack realignment (e.g. -mstackrealign). I cannot use any ot...

How do I enable the SSE3/SSE4.1 instruction set in Visual Studio 2008?

I tried to follow: Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set But the only options I got were - SSE or SSE2. Thanks. ...

What's the difference between logical SSE intrinsics?

Hello, Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (wi...

Benefit of using multiple SIMD instruction sets simultaneously

I'm writing a highly parallel application that's multithreaded. I've already got an SSE accelerated thread class written. If I were to write an MMX accelerated thread class, then run both at the same time (one SSE thread and one MMX thread per core) would the performance improve noticeably? I would think that this setup would help hide...

C/C++ usage of special CPU features

Hi, I am curious, do new compilers use some extra features built into new CPUs such as MMX SSE,3DNow! and so? I mean, in original 8086 there was even no FPU, so compiler that old cannot even use it, but new compilers can, since FPU is part of every new CPU. So, does new compilers use new features of CPU? Or, it should be more right...

Intrinsics program (SSE) - g++ - help needed

Hi all, This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful. I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variabl...

SSE SIMD Optimization For Loop

I have some code in a loop for(int i = 0; i < n; i++) { u[i] = c * u[i] + s * b[i]; } So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup? UPDATE I learnt vectorization (turns out it's not so hard if you use intrinsics) and ...

Need some constructive criticism on my SSE/Assembly attempt

Hello, I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code. The bit of code that I need to do this for is: float ox = p2x - (px * c - py * s)*m; float oy = p2y - (px * s - py * c)*m; What I've got for SSE code is: void assemblycalc(vector4 &p, vector4 &...

g++ SSE intrinsics dilemma - value from intrinsic "saturates"

Hi, I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop befo...

Concise SSE and MMX instruction reference with latencies and throughput

I am trying to optimize some arithmetic by using the MMX and SSE instruction sets with inline assembly. However, I have been unable to find good references for the timings and usages of these enhanced instruction sets. Could you please help me find references that contain information about the throughput, latency, operands, and perhaps s...

Fast 24-bit array -> 32-bit array conversion?

Quick Summary: I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements? Details: I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixel...

Fast double -> short conversion with clamping using SSE?

Is there a fast way to cast double values to shorts (16 bits signed), currently I'm doing something like this: double dval = <sum junk> int16_t sval; if (val > int16_max) { sval = int16_max; } else if (val < int16_min) { sval = int16_min; } else sval = (int16_t)val; I suspect there's a fast way to do this using SSE that wi...

How to use SSE with both Windows compiler and GCC compiler?

I have to optimize a piece of code using SSE extensions. My target platforms are Windows and Linux, so I build my application using MS compiler (VStudio) and GCC compiler. What approach does exist to involve SSE? I can find a lot of examples how to use SSE with GCC, but they seem to be incompatible to be used with MS compiler. Does exis...

Ruby SIMD & SSE

I'm wondering if there is a way to extend ruby Array type to do SIMD & SSE vector calculation. I mean implement in an low-level language to be used in ruby programs to number crunching tasks. ...