intrinsics

How to use MSVC intrinsics to get the equivalent of this GCC code?

The following code calls the builtin functions for clz/ctz in GCC and, on other systems, has C versions. Obviously, the C versions are a bit suboptimal if the system has a builtin clz/ctz instruction, like x86 and ARM. #ifdef __GNUC__ #define clz(x) __builtin_clz(x) #define ctz(x) __builtin_ctz(x) #else static uint32_t ALWAYS_INLINE po...

what is wrong with my version of _bittestandset

I am new to assembly language. It seems that gcc doesn't have _bittestandset function in intrin.h like MSVC does, so I implemented a new one. This one works fine in linux, but it goes wrong with mingw in winVista machine, the code is: inline unsigned char _bittestandset(unsigned long * a, unsigned long b) { __asm__ ( "bts %1, %[b]" ...

intrinsic memcmp

According to the gcc docs, memcmp is not an intrinsic function of GCC. If you wanted to speed up glibc's memcmp under gcc, you would need to use the lower level intrinsics defined in the docs. However, when searching around the internet, it seems that many people have the impression that memcmp is a builtin function. Is it for some compi...

Setting GCC 4.2.1 options in Xcode

Howdy, I have a few questions about Xcode and interaction with GCC 4.2.1: It doesn't seem as if Xcode Target Properties inspector exposes all possible GCC options. Is this correct? More specifically, I'm interested in setting the "mfpu" option, as mentioned in the arm_neon.h intrinsics header. Is this possible or supported? Or perhaps...

Dot product - SSE2 vs BLAS

What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so. Shove the y's in a matrix and use an optimized s/dgemv routine? Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo). I'm just looking for general guidance her...

Make compiler copy characters using movsd

I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd. However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler bui...

How to quickly find maximal element of a sum of vectors?

I have a following code in a most inner loop of my program struct V { float val [200]; // 0 <= val[i] <= 1 }; V a[600]; V b[250]; V c[250]; V d[350]; V e[350]; // ... init values in a,b,c,d,e ... int findmax(int ai, int bi, int ci, int di, int ei) { float best_val = 0.0; int best_ii = -1; for (int ii = 0; ii < 200; ii++) { ...

Using C intrinsics and memory alignment difficulties with classes

Ok, so I am just starting to use C intrinsics in my code and I have created a class, which simplified looks like this: class _Vector3D { public: _Vector3D() { aVals[0] = _mm_setzero_ps(); aVals[1] = _mm_setzero_ps(); aVals[2] = _mm_setzero_ps(); } ~_Vector3D() {} private: __m128 aVals[3]; }; So far so good. But when I create a sec...

C# fast crc32 calculation :

I've profiled my application with Ants and found out that > 10% is in CRC32 calculations. (The CRC32-calculation is done in plain C#) I did some googling and learned about the following intrinsics in Visual Studio 2008 : _mm_crc32_u8 _mm_crc32_u16 _mm_crc32_u32 _mm_crc32_u64 ( http://msdn.microsoft.com/en-us/library/bb514036.aspx )...

How do I replace __asm jno no_oflow with an intristic in a VS2008 64bit build?

I have this code: __asm jno no_oflow overflow = 1; __asm no_oflow: It produces this nice warning: error C4235: nonstandard extension used : '__asm' keyword not supported on this architecture What would be an equivalent/acceptable replacement for this code to check the overflow of a subtraction operation that happened before it? ...

x86 max/min asm instructions?

Are there any asm instructions that can speed up computation of min/max of vector of doubles/integers on Core i7 architecture? Update: I didn't expect such rich answers, thank you. So I see that max/min is possible to do without branching. I have sub-question: Is there an efficient way to get the index of the biggest double in array? ...

VC++ SSE intrinsic optimisation weirdness

I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I'm having is with my re-construction of the data to write out. Basically I'm reading in 16 bytes and then building them into a single __m128i variable and then using...

Equivalent of InterlockedIncrement in Linux/gcc

It would be a very simple question (could be duplicated), but I was unable to find it. Win32 API provides a very handy set of atomic operations (as intrinsics) such as InterlockedIncrement which emits lock add x86 code. Also, InterlockedCompareExchange is mapped to lock cmpxchg. But, I want to do that in Linux with gcc. Since I'm worki...

How does _mm_mwait works?

Hello How does _mm_mwait from pmmintrin.h works? (I mean not the asm for it, but action and how this action is taken in NUMA systems. The store monitoring is easy to implement only on bus-based SMP systems with snooping of bus.) What processors does implement it? Is it used in some spinlocks? ...

How do I reorder vector data using ARM Neon intrinsics?

This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows: There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit. 3B 3A 1B 1A There are another four, 32 bit elements in other Neon register say Q1 ...

Intel AVX intrinsics: any compatibility library out?

Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX exte...

does passing __m128i objects by reference to inline function cause these objects to be moved to stack?

Hello, I'm writing transpose function for 8x16bit vectors with SSE2 intrinsics. Since there are 8 arguments for that function (a matrix of 8x8x16bit size), I can't do anything but pass them by reference. Will that be optimized by the compiler (I mean, will these __m128i objects be passed in registers instead of stack)? Code snippet: i...

What's the difference between logical SSE intrinsics?

Hello, Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions: Is there any difference between using one or another intrinsic (wi...

Is there a good reference for ARM Neon intrinsics?

The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed? ...

Stack usage with MMX intrinsics and Microsoft C++

I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel. I would now like to convert this piece of code to MMX intrinsics but I am afraid that I wi...