This is a somewhat low-level question. In x86 assembly there are two SSE instructions:
MOVDQA xmmi, m128
and
MOVNTDQA xmmi, m128
The IA-32 Software Developer's Manual says that the NT in MOVNTDQA stands for Non-Temporal, and that otherwise it's the same as MOVDQA.
My question is, what does Non-Temporal mean?
...
I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code?
...
I am writing some code and trying to speed it up using SIMD intrinsics SSE2/3. My code is of such nature that I need to load some data into an XMM register and act on it many times. When I'm looking at the assembler code generated, it seems that GCC keeps flushing the data back to the memory, in order to reload something else in XMM0 an...
Hello all,
currently I'm dealing with a video processing software in which the picture data (8bit signed and unsigned) is stored in arrays of 16-aligned integers allocated as
__declspec(align(16)) int *pData = (__declspec(align(16)) int *)_mm_malloc(width*height*sizeof(int),16);
Generally, wouldn't it enable faster reading and writing...
I am interested in mastering prefetch-related functions such as
_mm_prefetch(...)
so when I perform operations that loop over arrays, the memory bandwidth is fully utilized. What are the best resources for learning about this?
I am doing this work in C using GCC 4 series on an intel linux platform.
...
I need to use the popcnt instruction in a project that is compiled using Visual Stdio 2005
The intrinsic __popcnt() only works with VS2008 and the compiler doesn't seem to recognize the instruction even when I write in a __asm {} block.
Is there any way to do this?
...
I was reading a question about c# code optimization and one solution was to use c++ with SSE. Is it possible to do SSE directly from a c# program?
...
Intel is set to release a new instruction set called AVX, which includes an extension of SSE to 256-bit operation. That is, either 4 double-precision elements or 8 single-precision elements.
How would one go about developing code for AVX, considering there's no hardware out there that supports it yet? More generally, how can developer...
I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run much faster compared to a normal loop written using bitwise AND , and if-else conditions. M...
I am trying to find information on glibc and to what extent it uses SSE functionality.
If it is optimized, can I use it out-of-the-box?
Say I am using one of the larger Linux distros, I assume that its glibc is compiled to be as generic as possible and to be as portable as possible, hence not optimized?
I am particular interested in ...
I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberately set up, or occur from the series of instructions before it. How often do general-purp...
I currently have the following code:
float a[4] = { 10, 20, 30, 40 };
float b[4] = { 0.1, 0.1, 0.1, 0.1 };
asm volatile("movups (%0), %%xmm0\n\t"
"mulps (%1), %%xmm0\n\t"
"movups %%xmm0, (%1)"
:: "r" (a), "r" (b));
I have first of all a few questions:
(1) if i WERE to a...
I'm writing a program in C that needs to do some fast math calculations. I'm using inline SSE assembly instructions to get some SIMD action (using packed double precision floating point numbers). I'm compiling using GCC on Linux.
I'm in a situation where I need to loop over some data, and I use a constant factor in my calculations. I'd ...
I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU....
I see some resources for gcc, but not for Visual Studio.
Anyone have a treasure trove of references, examples and tricks?
...
Hello, I want to learn more about using the SSE.
What ways are there to learn, besides the obvious reading the Intel® 64 and IA-32 Architectures Software Developer's Manuals ?
Mainly I'm interested to work with the GCC X86 Built-in Functions.
...
If I have a class like this:
typedef union { __m128 quad; float numbers[4]; } Data
class foo
{
public:
foo() : m_Data() {}
Data m_Data;
};
and a class like this:
class bar
{
public:
bar() : m_Data() {}
foo m_Data;
}
is foo's constructor called when making an instance of bar?
Because when I try to use bar's m_Data...
In the last couple of years, I've been doing a lot of SIMD programming and most of the time I've been relying on compiler intrinsic functions (such as the ones for SSE programming) or on programming assembly to get to the really nifty stuff. However, up until now I've hardly been able to find any programming language with built-in suppor...
In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, delta...
I need to learn assembly using SSE instructions and need gcc to link the ASM code with c code.
I have no idea where to start and google hasn't helped.
...