sse

Why is SSE scalar sqrt(x) slower than rsqrt(x) * x?

I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode! I'm testing it with a loop something ...

SSE register return with SSE disabled

I am in the following situation: I am writing code for a kernel that does not allow SSE instructions I need to do floating-point arithmetic I'm compiling for a x86_64 platform Here is a code sample that illustrates the problem: int main(int argc, char** argv) { double d = 0.0, dbase; uint64_t base_value = 300; d = (2200...

Java performance in numerical algorithms

hello again I am curious about performance of Java numerical algorithms, say for example matrix matrix double precision multiplication, using the latest JIT machines as compared for example to hand tuned SSE C++/assembler or Fortran counterparts. I have looked on the web but most of the results come from almost 10 years ago and I under...

SSE2: How to reduce a _m128 to a word

Hello What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word? I want the low part of each _m128 components: int result = ( _m128.a & 0x000000ff ) << 24 | ( _m128.b & 0x000000ff ) << 16 | ( _m128.c & 0x000000ff ) << 8 | ( _m128.d & 0x000000ff ) << 0 Is there an intrinsics for that ? than...

Benchmarking SSE instructions

I'm benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that the non-SSE code is faster than the SSE by a factor of 2-3. Can someone tell me what is wrong with the benchmarking code below? And perhap...

C - How to access elements of vector using GCC SSE vector extension

Usually I work with 3D vectors using following types: typedef vec3_t float[3]; initializing vectors using smth. like: vec3_t x_basis = {1.0, 0.0, 0.0}; vec3_t y_basis = {0.0, 1.0, 0.0}; vec3_t z_basis = {0.0, 0.0, 1.0}; and accessing them using smth. like: x_basis[X] * y_basis[X] + ... Now I need a vector arithmetics using SSE i...

Fast Image Manipulation using SSE instructions?

I am writing a graphics library in C and I would like to utilize SSE instructions to speed up some of the functions. How would I go about doing this? I am using the GCC compiler so I can rely on compiler intrinsics. I would also like to know whether I should change the way I am storing the image data (currently I am just using an array o...

Benefits of x87 over SSE

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers. ...

How to determine if memory is aligned? ( *testing* for alignment, not aligning )

Hi there, I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this: void sse_func(const float* const ptr, int len){ if( ptr is aligned ) { for( ... ){ // unroll loop by 4 or 2 elements } ...

Is there any SSE2+ book?

Is there any book that teaches SSE starting with version 2? I couldn't find any and there aren't many tutorials/articles on the net. Thanks in advance! ...

What is the fastest way to test if a double number is integer (in modern intel X86 processors)

Our server application does a lot of integer tests in a hot code path, currently we use the following function: inline int IsInteger(double n) { return n-floor(n) < 1e-8 } This function is very hot in our workload, so I want it to be as fast as possible. I also want to eliminate the "floor" library call if I can. Any suggestions? ...

how to work with 128 bits C variable and xmm 128 bits asm?

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how? asm ( "movdqa %1, %%xmm1;" "movdqa %0, %%xmm0;" "pxor %%xmm1,%%xmm0;" "movdqa %%xmm0, %0;" :"=x"(buff) /* output operand */ :"x"(bu), "x"(buff) :"%xmm0","%xmm1" ); but i have a Segmentation fault error; this is the objdump out...

VC++ SSE intrinsic optimisation weirdness

I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I'm having is with my re-construction of the data to write out. Basically I'm reading in 16 bytes and then building them into a single __m128i variable and then using...

How do you populate an x86 XMM register with 4 identical floats from another XMM register entry?

I'm trying to implement some inline assembler (in C/C++ code) to take advantage of SSE. I'd like to copy and duplicate values (from an XMM register, or from memory) to another XMM register. For example, suppose I have some values {1, 2, 3, 4} in memory. I'd like to copy these values such that xmm1 is populated with {1, 1, 1, 1}, xmm2 wit...

Common SIMD techniques

Hi! Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of t...

GCC Extended ASM syntax: load 128-bit memory location as source

Hi all, GCC generates this code for the shuffle() below: movaps xmm0,XMMWORD PTR [rip+0x125] pshufb xmm4,xmm0 Ideally this should be: pshufb xmm4,XMMWORD PTR [rip+0x125] What is the extended ASM syntax to generate this single instruction? Many thanks, Adam PS: The commented out intrinsic generates the optimal code for this examp...

Storing two x86 32 bit registers into 128 bit xmm register

Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register? movd xmm0, edx movd xmm1, eax pshufd xmm0, xmm0, $1 por xmm0, xmm1 So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678. Thanks ...

Is it possible to vectorize myNum += a[b[i]] * c[i]; on x86_64?

What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64? double myNum = 0; for(int i=0;i<n;i++){ myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double } ...

Qt, GCC, SSE and stack alignment

Hi there, I'm trying to make a program compiled with GCC and using Qt and SSE intrinsics. It seems that when one of my functions is called by Qt, the stack alignment is not preserved. Here's a short example to illustrate what I mean : #include <cstdio> #include <emmintrin.h> #include <QtGui/QApplication.h> #include <QtGui/QWidget.h> ...

Combining prefixes in SSE

In SSE the prefixes 066h (operand size override) 0F2H (REPNE) and 0F3h (REPE) are part of the opcode. In non-SSE 066h switches between 32-bit (or 64-bit) and 16-bit operation. 0F2h and 0F3h are used for string operations. They can be combined so that 066h and 0F2h (or 0F3h) can be used in the same instruction, because this is meaning...