I've added x64 configuration to my C++ project to compile 64-bit version of my app. Everything looks fine, but compiler gives the following warning:
`cl : Command line warning D9002 : ignoring unknown option '/arch:SSE2'`
Is there SSE2 optimization really not available for 64-bit projects?
...
I currently have the following code:
float a[4] = { 10, 20, 30, 40 };
float b[4] = { 0.1, 0.1, 0.1, 0.1 };
asm volatile("movups (%0), %%xmm0\n\t"
"mulps (%1), %%xmm0\n\t"
"movups %%xmm0, (%1)"
:: "r" (a), "r" (b));
I have first of all a few questions:
(1) if i WERE to a...
Hello,
I'm trying to come up with a way to make the computer do some work for me. I'm using SIMD (SSE2 & SSE3) to calculate the cross product, and I was wondering if it could go any faster. Currently I have the following:
const int maskShuffleCross1 = _MM_SHUFFLE(3,0,2,1); // y z x
const int maskShuffleCross2 = _MM_SHUFFLE(3,1,0,2); //...
I am trying to get SSE functionality in my vector class (I've rewritten it three times so far. :\) and I'm doing the following:
#ifndef _POINT_FINAL_H_
#define _POINT_FINAL_H_
#include "math.h"
namespace Vector3D
{
#define SSE_VERSION 3
#if SSE_VERSION >= 2
#include <emmintrin.h> // SSE2
#if SSE_VERSION >= 3
#inc...
I'm trying to break into SSE2 and tried the following example program:
#include "stdafx.h"
#include <emmintrin.h>
int main(int argc, char* argv[])
{
__declspec(align(16)) long mul; // multiply variable
__declspec(align(16)) int t1[100000]; // temporary variable
__declspec(align(16)) int t2[100000]; // temporary variable
__m128i mul...
EDIT:
This is a followup to SSE2 Compiler Error
This is the real bug I experienced before and have reproduced below by changing the _mm_malloc statement as Michael Burr suggested:
Unhandled exception at 0x00415116 in SO.exe: 0xC0000005: Access violation reading
location 0xffffffff.
At line label: movdqa xmm0, xmmword ptr [t1+...
I have two packed quadword integers in xmm0 and I need to add them together and store the result in a memory location. I can guarantee that the value of the each integer is less than 2^15. Right now, I'm doing the following:
int temp;
....
movdq2q mm0, xmm0
psrldq xmm0, 8
movdq2q mm1, xmm0
paddq mm0,mm1
movd temp, mm0...
I recently developed a Visual C++ console application which uses inline SSE2 instructions. It works fine on my computer, but when I tried it on another, it returns the following error:
The system cannot execute the specified program
Note that the program worked on the other computer before introducing the SSE2 code.
Any suggestions?
...
How do I check if a computer supports SSE2 in C++, I need to do that prior installing a software that needs the support for it. Any idea? Thank you.
Edit
from what I understand, I came up with this :
bool TestSSE2(char * szErrorMsg)
{
__try
{
__asm
{
xorpd xmm0, xmm0 // executing SSE2 ...
I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.
The following makes the call...
static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);
... and the following is what is executed.
void operator()(const blocked...
Hello,
Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (wi...
Hello,
In brief, I am trying to call into a shared library from python, more specifically, from numpy. The shared library is implemented in C using sse2 instructions. Enabling optimisation, i.e. building the library with -O2 or –O1, I am facing strange segfaults when calling into the shared library via ctypes. Disabling optimisation (-O...
I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote:
Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecate...
Why isn't the SSE2 enhanced instruction set optimization available for C++ programs compiled with the /clr switch?
...
I'm trying to compile some code that uses the intrinsic _mm_set_epi64x under Visual C++. This intrinsic is supported by VC but only when compiling for x86-64, not for x86-32. I assume this is not an actual limitation of the processor, because other compilers (GCC and Clang) support this intrinsic for both 32 and 64 bit compiles.
My firs...
I'm very new to SSE and have optimized a section of code using intrinsics. I'm pleased with the operation itself, but I'm looking for a better way to write the result. The results end up in three _m128i variables.
What I'm trying to do is store specific bytes from the result values to non-contiguous memory locations. I'm currently doin...
Hello,
My input data is 16-bit data, and I need to find a median of 3 values using SSE2 instruction set.
If I have 3 16-bits input values A, B and C, I thought to do it like this:
D = max( max( A, B ), C )
E = min( min( A, B ), C )
median = A + B + C - D - E
C functions I am planing to use are :
max - _mm_max_epi16
min - _mm_min_e...
I have the following bottleneck function.
typedef unsigned char byte;
void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3)
{
const byte b1 = 128-30;
const byte b2 = 128+30;
for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) {
*p3 = (*p1 < *p2 ) ? b1 : b2;
}
}
...
I am trying to optimize a function using SSE2. I'm wondering if I can prepare the data for my assembly code better than this way. My source data is a bunch of unsigned chars from pSrcData. I copy it to this array of floats, as my calculation needs to happen in float.
unsigned char *pSrcData = GetSourceDataPointer();
__declspec(alig...
In Visual C++, I'm trying to dynamically allocate some memory which is 16-byte aligned so I can use SSE2 functions that require memory alignment. Right now this is how I allocate the memory:
boost::shared_array aData(new unsigned char[GetSomeSizeToAllocate()]);
I know I can use _aligned_malloc to allocate aligned memory, but will th...