intrinsics

How to optimize a cycle?

I have the following bottleneck function. typedef unsigned char byte; void CompareArrays(const byte * p1Start, const byte * p1End, const byte * p2, byte * p3) { const byte b1 = 128-30; const byte b2 = 128+30; for (const byte * p1 = p1Start; p1 != p1End; ++p1, ++p2, ++p3) { *p3 = (*p1 < *p2 ) ? b1 : b2; } } ...

Leading zeros calculation with intrinsic function

I'm trying to optimize some code working in an embedded system (FLAC decoding, Windows CE, ARM 926 MCU). The default implementation uses a macro and a lookup table: /* counts the # of zero MSBs in a word */ #define COUNT_ZERO_MSBS(word) ( \ (word) <= 0xffff ? \ ( (word) <= 0xff? byte_to_unary_table[word] + 24 : \ byte_...