Is there any way to optimize the following line of C code (to avoid branching)?
if ((i < -threshold) || (i > threshold))
{
counter++;
}
All variables are 16-bit signed integers. An optimized version should be highly portable.
Is there any way to optimize the following line of C code (to avoid branching)?
if ((i < -threshold) || (i > threshold))
{
counter++;
}
All variables are 16-bit signed integers. An optimized version should be highly portable.
How about:
counter += (i < -threshold) | (i > threshold);
Assuming the original code was valid, then this should work too, in a portable way. The standard says that relational operators (<
, >
and so on) return an int
equal to 1
on success, or 0
on failure.
UPDATE
To answer Sheen's comment below, the following code:
int main()
{
short threshold = 10;
short i = 20;
short counter = 0;
counter += (i < -threshold) | (i > threshold);
return 0;
}
results in the following disassembler on x86 using GCC, with no optimisations:
push %rbp
mov %rsp,%rbp
movw $0xa,-6(%rbp)
movw $0x14,-4(%rbp)
movw $0x0,-2(%rbp)
movswl -4(%rbp),%edx
movswl -6(%rbp),%eax
neg %eax
cmp %eax,%edx
setl %dl
movzwl -4(%rbp),%eax
cmp -6(%rbp),%ax
setg %al
or %edx,%eax
movzbw %al,%dx
movzwl -2(%rbp),%eax
lea (%rdx,%rax,1),%eax
mov %ax,-2(%rbp)
mov $0x0,%eax
leaveq
retq
Compare the absolute of both
short imask = i >> sizeof(short) * 8 - 1; //compute the sign bit 1 or 0
short tmask = threshold >> sizeof(short) * 8 - 1; //compute the sign bit 1 or 0
short iabsolute = (i + imask) ^ imask; // compute i absolute
short tabsolute = (threshold + tmask) ^ tmask; // compute threshold absolute
counter += iabsolute > tabsolute;
Depending on the distribution of values of 'i', your CPU may well cache the branch prediction for you better than any code change you might make. See http://igoro.com/archive/fast-and-slow-if-statements-branch-prediction-in-modern-processors/ for an interesting writeup. Reddit discussion here: http://www.reddit.com/r/programming/comments/c7ues/fast_and_slow_ifstatements_branch_prediction_in/
You can use the following trick which reduces the branches to a single branch:
if (((unsigned) (i + threshold)) > (threshold << 1))
{
counter++;
}
or, for the pedantic:
if (((unsigned) i + (unsigned) threshold) > ((unsigned) threshold << 1))
{
counter++;
}
This is based on bit twiddling hacks, (highly recommended)
#define CHAR_BIT 8
int main()
{
int i=-3; // example input
int treshold=2; // example treshold
int count=0;
// step 1: find the absolute value of i
unsigned int r; // the result goes here
int const mask = i >> (sizeof(int) * CHAR_BIT - 1);
r = (i + mask) ^ mask;
// step 2: compute the sign of the difference
// sign becomes 0 (if r<=treshold)
// sign becomes 1 otherwise
int sign = 1 ^ ((unsigned int)(r-treshold-1) >> (sizeof(int) * CHAR_BIT - 1));
count+=sign;
return count;
}
This works for 32 bit integers, adapting to 16 bits should be easy. It compiles using g++.
The speed depends on the used processor. Branching might be faster after all.
Oli Charlesworth, I think, has the right idea. However, I suspect that it can be further optimized (at the expense of readability).
The threshold can be normalized to zero to remove a comparison.
That is, ...
counter += ((unsigned) (i + threshhold) < (unsigned) (threshhold + threshhold));
There is a standard idiom for range-checking with a single comparison instruction. It goes like:
(unsigned)x - a <= (unsigned)b - a /* a <= x <= b */
(unsigned)x - a < (unsigned)b - a /* a <= x < b */
As a common example (this version if isdigit
is guaranteed to be correct by the standard):
(unsigned)ch - '0' < 10
If your original type is larger than int
(for instance long long
) then you will need to use larger unsigned types (for instance unsigned long long
). If a
and b
are constants or already have unsigned type, or if you know b-a
will not overflow, you can omit the cast from b
.
In order for this method to work, naturally you must have a<=b
and the types/values must be such that the original expression (i.e. a <= x && x <= b
or similar) behaves mathematically correctly. For instance if x
were signed and b
unsigned, x<=b
could evaluate to false when x=-1
and b=UINT_MAX-1
. As long as your original types are all signed or smaller than the unsigned type you cast to, this is not an issue.
As for how this "trick" works, it is purely determining, after reduction modulo UINT_MAX+1
, whether x-a
lies in the range 0 to b-a
.
In your case, I think the following should work just fine:
(unsigned)i + threshold > 2U * threshold;
If threshold
does not change between loop iterations, the compiler can probably keep both threshold
and 2U*threshold
in registers.
Speaking of optimizations, a good compiler should optimize your original range test to use unsigned arithmetic where it knows the constraints are met. I suspect many do so with a
and b
constant, but perhaps not with more complex expressions. Even if the compiler can optimize it, though, the (unsigned)x-a<b-a
idiom is still extremely useful in macros where you want to ensure that x
is evaluated exactly once.
Oh, too bad the question has already been answered. To paraphrase Oli's answer, the code
#include <stdint.h>
int main()
{
int32_t threshold_square = 100;
int16_t i = 20;
int16_t counter = 0;
counter += ( (int32_t) i * i > threshold_square);
return 0;
}
yields the following x86 assembler using GCC without optimizations
pushq %rbp
movq %rsp, %rbp
movl $100, -8(%rbp)
movw $20, -2(%rbp)
movw $0, -4(%rbp)
movswl -2(%rbp),%edx
movswl -2(%rbp),%eax
imull %edx, %eax
cmpl -8(%rbp), %eax
setg %al
movzbl %al, %edx
movzwl -4(%rbp), %eax
leal (%rdx,%rax), %eax
movw %ax, -4(%rbp)
movl $0, %eax
leave
ret
which is four instructions less than using (i < -threshold) | (i > threshold)
.
Whether this is better or not is, of course, depending on the architecture.
(The use of stdint.h is for illustrative purposes, for strict C89 replace with whatever is relevant for the target system.)
What is wrong with the original code? Does it really need hand-optimising?
Any decent compiler should be able to optimise that very well. Any hand-optimising would probably only lead to obfuscation.
This code have no branch an highly portable (however, implementation of abs may have one).
#include <stdlib.h>
counter += abs(i) > threshold;
That's simplest standard compliant expression.
If your compiler does not using optimized macro for abs() you may use your own macro/ inline function.
That are examples, that use nature of twos complement format used on most machines:
#define ABS(x) ((x)*(((x)>>15)|1))
#define ABS(x) ((x)-((x)>>15)^((x)>>15))
Also you may replace comparison operator with expression like this:
#define LESS(x, y) (-((x)-(y))>>15))
Resulting code:
counter -= ((threshold - abs(i)) >> 15);
All those macros rely on fact, that shift right to number of bits minus one of positive value or zero evaluates to zero, and of negative evaluates to minus one. But thats implementation defined.