views:

39

answers:

1

I'm trying to optimize some code working in an embedded system (FLAC decoding, Windows CE, ARM 926 MCU).

The default implementation uses a macro and a lookup table:

/* counts the # of zero MSBs in a word */
#define COUNT_ZERO_MSBS(word) ( \
 (word) <= 0xffff ? \
  ( (word) <= 0xff? byte_to_unary_table[word] + 24 : \
              byte_to_unary_table[(word) >> 8] + 16 ) : \
  ( (word) <= 0xffffff? byte_to_unary_table[word >> 16] + 8 : \
              byte_to_unary_table[(word) >> 24] ) \
)

static const unsigned char byte_to_unary_table[] = {
    8, 7, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,
    3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};

However most CPU already have a dedicated instruction, bsr on x86 and clz on ARM (http://www.devmaster.net/articles/fixed-point-optimizations/), that should be more efficient.

On Windows CE we have the intrinsic function _CountLeadingZeros, that should just call that value. However it is 4 times slower than the macro (measured on 10 million of iterations).

How is possible that an intrinsic function, that (should) rely on a dedicated ASM instruction, is 4 times slower?

+5  A: 

Check the disassembly. Are you sure that the compiler inserted the instruction? In the Remarks section there is this text:

This function can be implemented by calling a runtime function.

I suspect that's what's happening in your case.

Note that the CLZ instruction is only available in ARMv5 and later. You need to tell the compiler if you want ARMv5 code:

/QRarch5 ARM5 Architecture
/QRarch5T ARM5T Architecture

(Microsoft incorrectly uses "ARM5" instead of "ARMv5")

Igor Skochinsky
I'll check the generated code. And if it is really a function call (instead of a true intrinsic) the call overhead would explain why it is so slow.
Lorenzo