when compiling this in ml64.exe 64bit (masm64)
the SSE command give me an error
what do i need to do to include the SSE commands in 64 bit?
.code
foo PROC
movlps [rdx], xmm7 ;;error A2070: invalid instruction operands
movhlps xmm6, xmm7
movss [rdx+8], xmm6 ;;rror A2070: invalid instruction operands
ret
foo ENDP
end
i get t...
(I'm a newbie to SSE/asm, apologies if this is obvious or redundant)
Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.)
Say you have registers v[0-7] ...
I've done some inline ASM coding for SSE before and it was not too hard even for someone who doesn't know ASM. But I note MS also provide intrinsics wrapping many such special instructions.
Is there a particular performance difference, or any other strong reason why one should be used above the other?
To repeat from the title, this is ...
Hi,
My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running th...
Case One
Say you have a little class:
class Point3D
{
private:
float x,y,z;
public:
operator+=()
...etc
};
Point3D &Point3D::operator+=(Point3D &other)
{
this->x += other.x;
this->y += other.y;
this->z += other.z;
}
A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we e...
Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that?
Update
How about this? Only 16 ...
I try to build an application which uses pthreads and __m128 SSE type. According to GCC manual, default stack alignment is 16 bytes. In order to use __m128, the requirement is the 16-byte alignment.
My target CPU supports SSE. I use a GCC compiler which doesn't support runtime stack realignment (e.g. -mstackrealign). I cannot use any ot...
I tried to follow:
Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set
But the only options I got were - SSE or SSE2.
Thanks.
...
Hello,
Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (wi...
I'm writing a highly parallel application that's multithreaded. I've already got an SSE accelerated thread class written. If I were to write an MMX accelerated thread class, then run both at the same time (one SSE thread and one MMX thread per core) would the performance improve noticeably?
I would think that this setup would help hide...
Hi, I am curious, do new compilers use some extra features built into new CPUs such as MMX SSE,3DNow! and so?
I mean, in original 8086 there was even no FPU, so compiler that old cannot even use it, but new compilers can, since FPU is part of every new CPU. So, does new compilers use new features of CPU?
Or, it should be more right...
Hi all,
This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful.
I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variabl...
I have some code in a loop
for(int i = 0; i < n; i++)
{
u[i] = c * u[i] + s * b[i];
}
So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?
UPDATE
I learnt vectorization (turns out it's not so hard if you use intrinsics) and ...
Hello, I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.
The bit of code that I need to do this for is:
float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;
What I've got for SSE code is:
void assemblycalc(vector4 &p, vector4 &...
Hi,
I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop befo...
I am trying to optimize some arithmetic by using the MMX and SSE instruction sets with inline assembly. However, I have been unable to find good references for the timings and usages of these enhanced instruction sets. Could you please help me find references that contain information about the throughput, latency, operands, and perhaps s...
Quick Summary:
I have an array of 24-bit values. Any suggestion on how to quickly expand the individual 24-bit array elements into 32-bit elements?
Details:
I'm processing incoming video frames in realtime using Pixel Shaders in DirectX 10. A stumbling block is that my frames are coming in from the capture hardware with 24-bit pixel...
Is there a fast way to cast double values to shorts (16 bits signed), currently I'm doing something like this:
double dval = <sum junk>
int16_t sval;
if (val > int16_max) {
sval = int16_max;
} else if (val < int16_min) {
sval = int16_min;
} else
sval = (int16_t)val;
I suspect there's a fast way to do this using SSE that wi...
I have to optimize a piece of code using SSE extensions. My target platforms are Windows and Linux, so I build my application using MS compiler (VStudio) and GCC compiler.
What approach does exist to involve SSE? I can find a lot of examples how to use SSE with GCC, but they seem to be incompatible to be used with MS compiler. Does exis...
I'm wondering if there is a way to extend ruby Array type to do SIMD & SSE vector calculation.
I mean implement in an low-level language to be used in ruby programs to number crunching tasks.
...