sse2

SSE2 option in Visual C++ (x64)

I've added x64 configuration to my C++ project to compile 64-bit version of my app. Everything looks fine, but compiler gives the following warning: `cl : Command line warning D9002 : ignoring unknown option '/arch:SSE2'` Is there SSE2 optimization really not available for 64-bit projects? ...

What's the most efficient way to multiply 4 floats by 4 floats using SSE ?

I currently have the following code: float a[4] = { 10, 20, 30, 40 }; float b[4] = { 0.1, 0.1, 0.1, 0.1 }; asm volatile("movups (%0), %%xmm0\n\t" "mulps (%1), %%xmm0\n\t" "movups %%xmm0, (%1)" :: "r" (a), "r" (b)); I have first of all a few questions: (1) if i WERE to a...

Call a function lower in the script from a function higher in the script

Hello, I'm trying to come up with a way to make the computer do some work for me. I'm using SIMD (SSE2 & SSE3) to calculate the cross product, and I was wondering if it could go any faster. Currently I have the following: const int maskShuffleCross1 = _MM_SHUFFLE(3,0,2,1); // y z x const int maskShuffleCross2 = _MM_SHUFFLE(3,1,0,2); //...

Giving an instance of a class a pointer to a struct

I am trying to get SSE functionality in my vector class (I've rewritten it three times so far. :\) and I'm doing the following: #ifndef _POINT_FINAL_H_ #define _POINT_FINAL_H_ #include "math.h" namespace Vector3D { #define SSE_VERSION 3 #if SSE_VERSION >= 2 #include <emmintrin.h> // SSE2 #if SSE_VERSION >= 3 #inc...

SSE2 Compiler Error

I'm trying to break into SSE2 and tried the following example program: #include "stdafx.h" #include <emmintrin.h> int main(int argc, char* argv[]) { __declspec(align(16)) long mul; // multiply variable __declspec(align(16)) int t1[100000]; // temporary variable __declspec(align(16)) int t2[100000]; // temporary variable __m128i mul...

SSE2 - 16-byte aligned dynamic allocation of memory

EDIT: This is a followup to SSE2 Compiler Error This is the real bug I experienced before and have reproduced below by changing the _mm_malloc statement as Michael Burr suggested: Unhandled exception at 0x00415116 in SO.exe: 0xC0000005: Access violation reading location 0xffffffff. At line label: movdqa xmm0, xmmword ptr [t1+...

Add the upper and lower 64-bits of a 128-bit xmm register

I have two packed quadword integers in xmm0 and I need to add them together and store the result in a memory location. I can guarantee that the value of the each integer is less than 2^15. Right now, I'm doing the following: int temp; .... movdq2q mm0, xmm0 psrldq xmm0, 8 movdq2q mm1, xmm0 paddq mm0,mm1 movd temp, mm0...

SSE2 - "The system cannot execute the specified program"

I recently developed a Visual C++ console application which uses inline SSE2 instructions. It works fine on my computer, but when I tried it on another, it returns the following error: The system cannot execute the specified program Note that the program worked on the other computer before introducing the SSE2 code. Any suggestions? ...

Check if Computer supports SSE2 in c++

How do I check if a computer supports SSE2 in C++, I need to do that prior installing a software that needs the support for it. Any idea? Thank you. Edit from what I understand, I came up with this : bool TestSSE2(char * szErrorMsg) { __try { __asm { xorpd xmm0, xmm0 // executing SSE2 ...

implement SIMD in C++

I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit. The following makes the call... static affinity_partitioner ap; parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap); ... and the following is what is executed. void operator()(const blocked...

What's the difference between logical SSE intrinsics?

Hello, Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (wi...

numpy calling sse2 via ctypes

Hello, In brief, I am trying to call into a shared library from python, more specifically, from numpy. The shared library is implemented in C using sse2 instructions. Enabling optimisation, i.e. building the library with -O2 or –O1, I am facing strange segfaults when calling into the shared library via ctypes. Disabling optimisation (-O...

Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote: Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecate...

SSE2 instruction support with /CLR switch.

Why isn't the SSE2 enhanced instruction set optimization available for C++ programs compiled with the /clr switch? ...

What can be used to replace _mm_set_epi64x on 32-bit Windows?

I'm trying to compile some code that uses the intrinsic _mm_set_epi64x under Visual C++. This intrinsic is supported by VC but only when compiling for x86-64, not for x86-32. I assume this is not an actual limitation of the processor, because other compilers (GCC and Clang) support this intrinsic for both 32 and 64 bit compiles. My firs...

How To Store Values In Non-Contiguous Memory Locations With SSE Intrinsics?

I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables. What I'm trying to do is store specific bytes from the result values to non-contiguous memory locations. I'm currently doin...

Finding a median using SSE2 instruction set

Hello, My input data is 16-bit data, and I need to find a median of 3 values using SSE2 instruction set. If I have 3 16-bits input values A, B and C, I thought to do it like this: D = max( max( A, B ), C ) E = min( min( A, B ), C ) median = A + B + C - D - E C functions I am planing to use are : max - _mm_max_epi16 min - _mm_min_e...

How to optimize a cycle?

I have the following bottleneck function. typedef unsigned char byte; void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3) { const byte b1 = 128-30; const byte b2 = 128+30; for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) { *p3 = (*p1 < *p2 ) ? b1 : b2; } } ...

Converting unsigned chars to float in assembly (to prepare for float vector calculations)

I am trying to optimize a function using SSE2. I'm wondering if I can prepare the data for my assembly code better than this way. My source data is a bunch of unsigned chars from pSrcData. I copy it to this array of floats, as my calculation needs to happen in float. unsigned char *pSrcData = GetSourceDataPointer(); __declspec(alig...

boost::shared_array and aligned memory allocation

In Visual C++, I'm trying to dynamically allocate some memory which is 16-byte aligned so I can use SSE2 functions that require memory alignment. Right now this is how I allocate the memory: boost::shared_array aData(new unsigned char[GetSomeSizeToAllocate()]); I know I can use _aligned_malloc to allocate aligned memory, but will th...