sse

What is the meaning of "non temporal" memory accesses in x86

This is a somewhat low-level question. In x86 assembly there are two SSE instructions: MOVDQA xmmi, m128 and MOVNTDQA xmmi, m128 The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise it's the same as MOVDQA. My question is, what does Non-Temporal mean? ...

How much speed-up from converting 3D maths to SSE or other SIMD?

I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code? ...

How to get GCC to use more than two SIMD registers when using intrinsics?

I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 an...

difference in speed between char and integer arrays?

Hello all, currently I'm dealing with a video processing software in which the picture data (8bit signed and unsigned) is stored in arrays of 16-aligned integers allocated as __declspec(align(16)) int *pData = (__declspec(align(16)) int *)_mm_malloc(width*height*sizeof(int),16); Generally, wouldn't it enable faster reading and writing...

Best resource for learning about prefetching a buffer in C on Intel/AMD 64 bit

I am interested in mastering prefetch-related functions such as _mm_prefetch(...) so when I perform operations that loop over arrays, the memory bandwidth is fully utilized. What are the best resources for learning about this? I am doing this work in C using GCC 4 series on an intel linux platform. ...

SSE4 instructions in VS2005?

I need to use the popcnt instruction in a project that is compiled using Visual Stdio 2005 The intrinsic __popcnt() only works with VS2008 and the compiler doesn't seem to recognize the instruction even when I write in a __asm {} block. Is there any way to do this? ...

Using SSE in c# is it possible?

I was reading a question about c# code optimization and one solution was to use c++ with SSE. Is it possible to do SSE directly from a c# program? ...

developing for new instruction sets

Intel is set to release a new instruction set called AVX, which includes an extension of SSE to 256-bit operation. That is, either 4 double-precision elements or 8 single-precision elements. How would one go about developing code for AVX, considering there's no hardware out there that supports it yet? More generally, how can developer...

Using SSE instructions

I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. M...

glibc and SSE functionality

I am trying to find information on glibc and to what extent it uses SSE functionality. If it is optimized, can I use it out-of-the-box? Say I am using one of the larger Linux distros, I assume that its glibc is compiled to be as generic as possible and to be as portable as possible, hence not optimized? I am particular interested in ...

How do modern compilers use mmx/3dnow/sse instructions?

I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purp...

What's the most efficient way to multiply 4 floats by 4 floats using SSE ?

I currently have the following code: float a[4] = { 10, 20, 30, 40 }; float b[4] = { 0.1, 0.1, 0.1, 0.1 }; asm volatile("movups (%0), %%xmm0\n\t" "mulps (%1), %%xmm0\n\t" "movups %%xmm0, (%1)" :: "r" (a), "r" (b)); I have first of all a few questions: (1) if i WERE to a...

Get GCC to preserve an SSE register throughout a function that uses inline asm

I'm writing a program in C that needs to do some fast math calculations. I'm using inline SSE assembly instructions to get some SIMD action (using packed double precision floating point numbers). I'm compiling using GCC on Linux. I'm in a situation where I need to loop over some data, and I use a constant factor in my calculations. I'd ...

How to determine SSE prefetch instruction size?

I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU....

Resources for (Manual and Automatic) Loop Vectorization

I see some resources for gcc, but not for Visual Studio. Anyone have a treasure trove of references, examples and tricks? ...

Getting started with SSE

Hello, I want to learn more about using the SSE. What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals ? Mainly I'm interested to work with the GCC X86 Built-in Functions. ...

Calling constructor from another class

If I have a class like this: typedef union { __m128 quad; float numbers[4]; } Data class foo { public: foo() : m_Data() {} Data m_Data; }; and a class like this: class bar { public: bar() : m_Data() {} foo m_Data; } is foo's constructor called when making an instance of bar? Because when I try to use bar's m_Data...

SIMD programming languages

In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on programming assembly to get to the really nifty stuff. However, up until now I've hardly been able to find any programming language with built-in suppor...

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows: If I have the following code: float px[32768],py[32768],pz[32768]; float xref, yref, zref, delta...

Whats a good place to start learning assembly?

I need to learn assembly using SSE instructions and need gcc to link the ASM code with c code. I have no idea where to start and google hasn't helped. ...