simd

Which assemblers currently support the AVX instruction set?

I'd like to start and play with some AVX (advanced vector extension) instructions. I know Intel provides an emulator to test software containing these instructions (see this question), but since I don't want to manually write hex code, the question arises as to which assemblers currently know the AVX instruction set? I would be most in...

Taking advantage of SSE and other CPU extensions.

Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these. I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried...

How to determine if memory is aligned? ( *testing* for alignment, not aligning )

Hi there, I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this: void sse_func(const float* const ptr, int len){ if( ptr is aligned ) { for( ... ){ // unroll loop by 4 or 2 elements } ...

how to work with 128 bits C variable and xmm 128 bits asm?

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how? asm ( "movdqa %1, %%xmm1;" "movdqa %0, %%xmm0;" "pxor %%xmm1,%%xmm0;" "movdqa %%xmm0, %0;" :"=x"(buff) /* output operand */ :"x"(bu), "x"(buff) :"%xmm0","%xmm1" ); but i have a Segmentation fault error; this is the objdump out...

SIMD or not SIMD - cross platform

I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well as I want to be able at run time to switch between SIMD and not SIMD. How would you suggest me to approach this problem? (Of course I don't want to implement th...

Common SIMD techniques

Hi! Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of t...

Storing two x86 32 bit registers into 128 bit xmm register

Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register? movd xmm0, edx movd xmm1, eax pshufd xmm0, xmm0, $1 por xmm0, xmm1 So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678. Thanks ...

Is it possible to vectorize myNum += a[b[i]] * c[i]; on x86_64?

What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64? double myNum = 0; for(int i=0;i<n;i++){ myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double } ...

SIMD Sony Vector Math Library in OS X with C++

I'm currently writing a very simple game engine for an assignment and to make the code a lot nicer I've decided to use a vector math library. One of my lecturers showed me the Sony Vector Math library which is used in the Bullet Physics engine and it's great as far as I can see. I've got it working on Linux nicely but I'm having problems...

Visual studio compiler flag /arch and performance

I just noticed that in our project have left the "Enable Enhanced Instruction Set" flag left unset, probably just an oversight. Before enabling the flag I would like to ask if anyone have seen any real-world performance improvements enabling it ? I guess we will see some improvement our application constantly do floating point based c...

Fast, Vectorizable method of taking floating point number modulus of special primes?

Is there a fast method for taking the modulus of a floating point number? With integers, there are tricks for Mersenne primes, so that its possible to calculate y = x MOD 2^31-1 without needing division. integer trick Can any similar tricks be applied for floating point numbers? Preferably, in a way that can be converted into vect...

transpose for 8 registers of 16-bit elements on SSE2/SSSE3

(I'm a newbie to SSE/asm, apologies if this is obvious or redundant) Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.) Say you have registers v[0-7] ...

What approach to take for SIMD optimizations

Hi, I am trying to optimize below code for SIMD operations (8way/4way/2way SIMD whiechever possible and if it gives gains in performance) I am tryin to analyze it first on paper to understand the algorithm used. How can i optimize it for SIMD:- void idct(uint8_t *dst, int stride, int16_t *input, int type) { int16_t *ip = input...

Intrinsics Vs inline ASM for SSE coding in VC++ 2K8

I've done some inline ASM coding for SSE before and it was not too hard even for someone who doesn't know ASM. But I note MS also provide intrinsics wrapping many such special instructions. Is there a particular performance difference, or any other strong reason why one should be used above the other? To repeat from the title, this is ...

implement SIMD in C++

I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit. The following makes the call... static affinity_partitioner ap; parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap); ... and the following is what is executed. void operator()(const blocked...

How fast can you make linear search?

I'm looking to optimize this linear search: static int linear (const int *arr, int n, int key) { int i = 0; while (i < n) { if (arr [i] >= key) break; ++i; } return i; } The array is sorted and the function is supposed to return the index of the fi...

What's the right way to utilize ARM SIMD on iPhone for Game vector/matrix operation?

I'm making an vector/matrix library for Game which utilizes SIMD unit on iPhone (3GS or later). How can I do this? I searched about this, now I know several options: Accelerate framework (BLAS+LAPACK+...) from Apple (iPhone OS 4) OpenMAX implementation library from ARM GCC auto-vectorization feature What's the most suitable way for v...

How do I enable the SSE3/SSE4.1 instruction set in Visual Studio 2008?

I tried to follow: Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set But the only options I got were - SSE or SSE2. Thanks. ...

Benefit of using multiple SIMD instruction sets simultaneously

I'm writing a highly parallel application that's multithreaded. I've already got an SSE accelerated thread class written. If I were to write an MMX accelerated thread class, then run both at the same time (one SSE thread and one MMX thread per core) would the performance improve noticeably? I would think that this setup would help hide...

SSE SIMD Optimization For Loop

I have some code in a loop for(int i = 0; i < n; i++) { u[i] = c * u[i] + s * b[i]; } So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup? UPDATE I learnt vectorization (turns out it's not so hard if you use intrinsics) and ...