The following code calls the builtin functions for clz/ctz in GCC and, on other systems, has C versions. Obviously, the C versions are a bit suboptimal if the system has a builtin clz/ctz instruction, like x86 and ARM.
#ifdef __GNUC__
#define clz(x) __builtin_clz(x)
#define ctz(x) __builtin_ctz(x)
#else
static uint32_t ALWAYS_INLINE po...
I am new to assembly language. It seems that gcc doesn't have _bittestandset function in intrin.h like MSVC does, so I implemented a new one. This one works fine in linux, but it goes wrong with mingw in winVista machine, the code is:
inline unsigned char _bittestandset(unsigned long * a, unsigned long b)
{
__asm__ ( "bts %1, %[b]"
...
According to the gcc docs, memcmp is not an intrinsic function of GCC. If you wanted to speed up glibc's memcmp under gcc, you would need to use the lower level intrinsics defined in the docs. However, when searching around the internet, it seems that many people have the impression that memcmp is a builtin function. Is it for some compi...
Howdy,
I have a few questions about Xcode and interaction with GCC 4.2.1:
It doesn't seem as if Xcode Target Properties inspector exposes all possible GCC options. Is this correct?
More specifically, I'm interested in setting the "mfpu" option, as mentioned in the arm_neon.h intrinsics header. Is this possible or supported? Or perhaps...
What's my best bet for computing the dot product of a vector x with a large number of vectors y_i, where x and y_i are of length 10k or so.
Shove the y's in a matrix and use an optimized s/dgemv routine?
Or maybe try handcoding an SSE2 solution (I don't have SSE3, according to cpuinfo).
I'm just looking for general guidance her...
I would like to copy a relatively short sequence of memory (less than 1 KB, typically 2-200 bytes) in a time critical function. The best code for this on CPU side seems to be rep movsd. However I somehow cannot make my compiler to generate this code. I hoped (and I vaguely remember seeing so) using memcpy would do this using compiler bui...
I have a following code in a most inner loop of my program
struct V {
float val [200]; // 0 <= val[i] <= 1
};
V a[600];
V b[250];
V c[250];
V d[350];
V e[350];
// ... init values in a,b,c,d,e ...
int findmax(int ai, int bi, int ci, int di, int ei) {
float best_val = 0.0;
int best_ii = -1;
for (int ii = 0; ii < 200; ii++) {
...
Ok, so I am just starting to use C intrinsics in my code and I have created a class, which simplified looks like this:
class _Vector3D
{
public:
_Vector3D()
{
aVals[0] = _mm_setzero_ps();
aVals[1] = _mm_setzero_ps();
aVals[2] = _mm_setzero_ps();
}
~_Vector3D() {}
private:
__m128 aVals[3];
};
So far so good. But when I create a sec...
I've profiled my application with Ants and found out that > 10% is in CRC32 calculations.
(The CRC32-calculation is done in plain C#)
I did some googling and learned about the following intrinsics in Visual Studio 2008 :
_mm_crc32_u8
_mm_crc32_u16
_mm_crc32_u32
_mm_crc32_u64
( http://msdn.microsoft.com/en-us/library/bb514036.aspx )...
I have this code:
__asm jno no_oflow
overflow = 1;
__asm no_oflow:
It produces this nice warning:
error C4235: nonstandard extension used : '__asm' keyword not supported on this architecture
What would be an equivalent/acceptable replacement for this code to check the overflow of a subtraction operation that happened before it?
...
Are there any asm instructions that can speed up computation of min/max of vector of doubles/integers on Core i7 architecture?
Update:
I didn't expect such rich answers, thank you.
So I see that max/min is possible to do without branching.
I have sub-question:
Is there an efficient way to get the index of the biggest double in array?
...
I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I'm having is with my re-construction of the data to write out.
Basically I'm reading in 16 bytes and then building them into a single __m128i variable and then using...
It would be a very simple question (could be duplicated), but I was unable to find it.
Win32 API provides a very handy set of atomic operations (as intrinsics) such as InterlockedIncrement which emits lock add x86 code. Also, InterlockedCompareExchange is mapped to lock cmpxchg.
But, I want to do that in Linux with gcc. Since I'm worki...
Hello
How does _mm_mwait from pmmintrin.h works? (I mean not the asm for it, but action and how this action is taken in NUMA systems. The store monitoring is easy to implement only on bus-based SMP systems with snooping of bus.)
What processors does implement it?
Is it used in some spinlocks?
...
This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows:
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
3B 3A 1B 1A
There are another four, 32 bit elements in other Neon register say Q1 ...
Are there any Intel AVX intrinsics library out? I'm looking for something similar as 'sse2mmx.h' header which fall-backs to MMX intrinsics if SSE2 integer intrinsics are not available on compile time. Thus if I had similar library for AVX I could write optimized code for new hardware which would have almost optimal speed in case AVX exte...
Hello,
I'm writing transpose function for 8x16bit vectors with SSE2 intrinsics. Since there are 8 arguments for that function (a matrix of 8x8x16bit size), I can't do anything but pass them by reference. Will that be optimized by the compiler (I mean, will these __m128i objects be passed in registers instead of stack)?
Code snippet:
i...
Hello,
Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:
Is there any difference between using one or another intrinsic (wi...
The ARM reference manual doesn't go into too much detail into the individual instructions ( http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/BABIIBBG.html ). Is there something that's a little more detailed?
...
I have an inline assembler loop that cumulatively adds elements from an int32 data array with MMX instructions. In particular, it uses the fact that the MMX registers can accommodate 16 int32s to calculate 16 different cumulative sums in parallel.
I would now like to convert this piece of code to MMX intrinsics but I am afraid that I wi...