I'd like to start and play with some AVX (advanced vector extension) instructions. I know Intel provides an emulator to test software containing these instructions (see this question), but since I don't want to manually write hex code, the question arises as to which assemblers currently know the AVX instruction set?
I would be most in...
Theres are couple of places in my code base where the same operation is repeated a very large number of times for a large data set. In some cases it's taking a considerable time to process these.
I believe that using SSE to implement these loops should improve their performance significantly, especially where many operations are carried...
Hi there,
I am new to optimizing code with SSE/SSE2 instructions and until now I have not gotten very far. To my knowledge a common SSE-optimized function would look like this:
void sse_func(const float* const ptr, int len){
if( ptr is aligned )
{
for( ... ){
// unroll loop by 4 or 2 elements
}
...
in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
but i have a Segmentation fault error;
this is the objdump out...
I need some idea how to write a C++ cross platform implementation of a few parallelizable problems in a way so I can take advantage of SIMD (SSE, SPU, etc) if available. As well as I want to be able at run time to switch between SIMD and not SIMD.
How would you suggest me to approach this problem?
(Of course I don't want to implement th...
Hi!
Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code.
For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of t...
Is there any faster method to store two x86 32 bit registers in one 128 bit xmm register?
movd xmm0, edx
movd xmm1, eax
pshufd xmm0, xmm0, $1
por xmm0, xmm1
So if EAX is 0x12345678 and EDX is 0x87654321 the result in xmm0 must be 0x8765432112345678.
Thanks
...
What intrinsics would I use to vectorize the following(if it's even possible to vectorize) on the x86_64?
double myNum = 0;
for(int i=0;i<n;i++){
myNum += a[b[i]] * c[i]; //b[i] = int, a[b[i]] = double, c[i] = double
}
...
I'm currently writing a very simple game engine for an assignment and to make the code a lot nicer I've decided to use a vector math library. One of my lecturers showed me the Sony Vector Math library which is used in the Bullet Physics engine and it's great as far as I can see. I've got it working on Linux nicely but I'm having problems...
I just noticed that in our project have left the "Enable Enhanced Instruction Set" flag left unset, probably just an oversight.
Before enabling the flag I would like to ask if anyone have seen any real-world performance improvements enabling it ?
I guess we will see some improvement our application constantly do floating point based c...
Is there a fast method for taking the modulus of a floating point number?
With integers, there are tricks for Mersenne primes, so that its possible to calculate y = x MOD 2^31-1 without needing division. integer trick
Can any similar tricks be applied for floating point numbers?
Preferably, in a way that can be converted into vect...
(I'm a newbie to SSE/asm, apologies if this is obvious or redundant)
Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.)
Say you have registers v[0-7] ...
Hi,
I am trying to optimize below code for SIMD operations (8way/4way/2way SIMD whiechever possible and if it gives gains in performance) I am tryin to analyze it first on paper to understand the algorithm used. How can i optimize it for SIMD:-
void idct(uint8_t *dst, int stride, int16_t *input, int type)
{
int16_t *ip = input...
I've done some inline ASM coding for SSE before and it was not too hard even for someone who doesn't know ASM. But I note MS also provide intrinsics wrapping many such special instructions.
Is there a particular performance difference, or any other strong reason why one should be used above the other?
To repeat from the title, this is ...
I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.
The following makes the call...
static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);
... and the following is what is executed.
void operator()(const blocked...
I'm looking to optimize this linear search:
static int
linear (const int *arr, int n, int key)
{
int i = 0;
while (i < n) {
if (arr [i] >= key)
break;
++i;
}
return i;
}
The array is sorted and the function is supposed to return the index of the fi...
I'm making an vector/matrix library for Game which utilizes SIMD unit on iPhone (3GS or later).
How can I do this?
I searched about this, now I know several options:
Accelerate framework (BLAS+LAPACK+...) from Apple (iPhone OS 4)
OpenMAX implementation library from ARM
GCC auto-vectorization feature
What's the most suitable way for v...
I tried to follow:
Project > Properties > Configuration Properties > C/C++ > Code Generation > Enable Enhanced Instruction Set
But the only options I got were - SSE or SSE2.
Thanks.
...
I'm writing a highly parallel application that's multithreaded. I've already got an SSE accelerated thread class written. If I were to write an MMX accelerated thread class, then run both at the same time (one SSE thread and one MMX thread per core) would the performance improve noticeably?
I would think that this setup would help hide...
I have some code in a loop
for(int i = 0; i < n; i++)
{
u[i] = c * u[i] + s * b[i];
}
So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?
UPDATE
I learnt vectorization (turns out it's not so hard if you use intrinsics) and ...