I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!
I'm testing it with a loop something ...
I am in the following situation:
I am writing code for a kernel that does not allow SSE instructions
I need to do floating-point arithmetic
I'm compiling for a x86_64 platform
Here is a code sample that illustrates the problem:
int
main(int argc, char** argv)
{
double d = 0.0, dbase;
uint64_t base_value = 300;
d = (2200...
hello again
I am curious about performance of Java numerical algorithms, say for example matrix matrix double precision multiplication, using the latest JIT machines as compared for example to hand tuned SSE C++/assembler or Fortran counterparts.
I have looked on the web but most of the results come from almost 10 years ago and I under...
Hello
What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word?
I want the low part of each _m128 components:
int result = ( _m128.a & 0x000000ff ) << 24
| ( _m128.b & 0x000000ff ) << 16
| ( _m128.c & 0x000000ff ) << 8
| ( _m128.d & 0x000000ff ) << 0
Is there an intrinsics for that ? than...
I'm benchmarking some SSE code (multiplying 4 floats by 4 floats) against traditional C code doing the same thing. I think my benchmark code must be incorrect in some way because it seems to say that the non-SSE code is faster than the SSE by a factor of 2-3.
Can someone tell me what is wrong with the benchmarking code below? And perhap...
Usually I work with 3D vectors using following types:
typedef vec3_t float[3];
initializing vectors using smth. like:
vec3_t x_basis = {1.0, 0.0, 0.0};
vec3_t y_basis = {0.0, 1.0, 0.0};
vec3_t z_basis = {0.0, 0.0, 1.0};
and accessing them using smth. like:
x_basis[X] * y_basis[X] + ...
Now I need a vector arithmetics using SSE i...
I am writing a graphics library in C and I would like to utilize SSE instructions to speed up some of the functions. How would I go about doing this? I am using the GCC compiler so I can rely on compiler intrinsics. I would also like to know whether I should change the way I am storing the image data (currently I am just using an array o...
I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers.
...
Hi there,
I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
void sse_func(const float* const ptr, int len){
if( ptr is aligned )
{
for( ... ){
// unroll loop by 4 or 2 elements
}
...
Is there any book that teaches SSE starting with version 2? I couldn't find any and there aren't many tutorials/articles on the net.
Thanks in advance!
...
Our server application does a lot of integer tests in a hot code path, currently we use the following function:
inline int IsInteger(double n)
{
return n-floor(n) < 1e-8
}
This function is very hot in our workload, so I want it to be as fast as possible. I also want to eliminate the "floor" library call if I can. Any suggestions?
...
in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
but i have a Segmentation fault error;
this is the objdump out...
I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I'm having is with my re-construction of the data to write out.
Basically I'm reading in 16 bytes and then building them into a single __m128i variable and then using...
I'm trying to implement some inline assembler (in C/C++ code) to take advantage of SSE. I'd like to copy and duplicate values (from an XMM register, or from memory) to another XMM register. For example, suppose I have some values {1, 2, 3, 4} in memory. I'd like to copy these values such that xmm1 is populated with {1, 1, 1, 1}, xmm2 wit...
Hi!
Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code.
For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of t...
Hi all,
GCC generates this code for the shuffle() below:
movaps xmm0,XMMWORD PTR [rip+0x125]
pshufb xmm4,xmm0
Ideally this should be:
pshufb xmm4,XMMWORD PTR [rip+0x125]
What is the extended ASM syntax to generate this single instruction?
Many thanks,
Adam
PS: The commented out intrinsic generates the optimal code for this examp...
Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register?
movd xmm0, edx
movd xmm1, eax
pshufd xmm0, xmm0, $1
por xmm0, xmm1
So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678.
Thanks
...
What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64?
double myNum = 0;
for(int i=0;i<n;i++){
myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}
...
Hi there,
I'm trying to make a program compiled with GCC and using Qt and SSE intrinsics.
It seems that when one of my functions is called by Qt, the stack alignment is not preserved. Here's a short example to illustrate what I mean :
#include <cstdio>
#include <emmintrin.h>
#include <QtGui/QApplication.h>
#include <QtGui/QWidget.h>
...
In SSE the prefixes 066h (operand size override) 0F2H (REPNE) and 0F3h (REPE) are part of the opcode.
In non-SSE 066h switches between 32-bit (or 64-bit) and 16-bit operation. 0F2h and 0F3h are used for string operations. They can be combined so that 066h and 0F2h (or 0F3h) can be used in the same instruction, because this is meaning...