views:

259

answers:

8
float mixValue = ... //in range -1.0f to 1.0f
for(... ; ... ; ...  ) //long loop
{
    float inputLevel = ... //in range -1.0f to 1.0f
    if(inputLevel < 0.0 && mixValue < 0.0)
    {
        mixValue = (mixValue + inputLevel) + (mixValue*inputLevel);
    }
    else
    {
        mixValue = (mixValue + inputLevel) - (mixValue*inputLevel);
    }
}

just a simple question, can we calculate mixValue without branching? or any other optimization suggestion, such as using SIMD?

edit: just for more information, I ended up using this solution, based on choosen answer:

const float sign[] = {-1, 1};
float mixValue = ... //in range -1.0f to 1.0f
for(... ; ... ; ...  ) //long loop
{
    float inputLevel = ... //in range -1.0f to 1.0f
    unsigned a = *(unsigned*)(&mixValue);
    unsigned b = *(unsigned*)(&inputLevel);

    float mulValue = mixValue * inputLevel * sign[(a & b) >> (8*sizeof(unsigned)-1)];
    float addValue = mixValue + inputLevel;
    mixValue = addValue + mulValue;
}

thank you.

A: 

Have you benchmarked the loop with and without the branch ?

At least you could remove one part of the branch, since mixValue is outside of the loop.

float multiplier(float a, float b){
  unsigned char c1Neg = reinterpret_cast<unsigned char *>(&a)[3] & 0x80;
  unsigned char c2Neg = reinterpret_cast<unsigned char *>(&b)[3] & 0x80;
  unsigned char multiplierIsNeg = c1Neg & c2Neg;
  float one = 1;
  reinterpret_cast<unsigned char *>(&one)[3] |= multiplierIsNeg;
  return -one;
}
cout << multiplier(-1,-1) << endl; // +1
cout << multiplier( 1,-1) << endl; // -1
cout << multiplier( 1, 1) << endl; // -1
cout << multiplier(-1, 1) << endl; // -1
tibur
my question is to eliminate branch, the benchmark result is out of this question.
uray
`mixValue` is loop dependent variable see: `mixValue = (mixValue + ...`
uray
you're right. Sorry.
tibur
That should do the trick with only four bitwise operations.
tibur
A: 

If you are worried about excessive branching, look at Duff's Device. This should help unwind the loop somewhat. Truth be told, loop unwinding is something that will be done by the optimizer, so trying to do it by hand may be a waste of time. Check the assembly output to find out.

SIMD will definitely be of assistance provided you a performing the exact same operation to each item in your array. Be aware than not all hardware supports SIMD but some compilers like gcc do provide intrinsics for SIMD which will save your from dipping into assembler.

If you are using gcc to compile ARM code, the SIMD intrinsics can be found here

doron
I've seen the asm output, it does unwind the loop, but branching is still there, it create two code path, and no SIMD applied although it used instruction like mulss or addss on xmm reg
uray
The problem with your code is that `mixValue` is changed between iterations, so I guess no SIMD is possible here.
jpalecek
A: 

Just off the top of my head (I'm sure it can be reduced):

mixValue = (mixValue + inputLevel) + (((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1) / fabs(((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1)))*-1*(mixValue*inputLevel);

Just to clarify a bit, I'll calculate sign separately:

float sign = (((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1) / fabs(((mixValue / fabs(mixValue)) + (inputLevel / fabs(inputLevel))+1)))*-1;
mixValue = (mixValue + inputLevel) + sign*(mixValue*inputLevel);

This is floating point math, so you'll likely need to correct for some rounding issues, but that should set you on the right path I think.

Anthony DiSanti
I bet division is even more inefficient than branching.
KennyTM
@NullUserException: `fabs()` can be computed without branching.
jpalecek
+4  A: 

How about this:

const float sign[] = {-1, 1};

float mixValue = ... //in range -1.0f to 1.0f
for(... ; ... ; ...  ) //long loop
{
    float inputLevel = ... //in range -1.0f to 1.0f
    int bothNegative = (inputLevel < 0.0) & (mixValue < 0.0);
    mixValue = (mixValue + inputLevel) + (sign[bothNegative]*mixValue*inputLevel);
}

Edit: Mike was correct that && would introduce a branch and thanks for Pedro for proving it. I changed && to & and now GCC (version 4.4.0) generates branch-free code.

Roku
The problem is : if bothNefative is false, it's equal to 0, so it can never been negative.
Klaim
@Klaim: `sign[0]` is -1, so `sign[bothNegative]` with `bothNegative==0` is -1
uray
Ah yes I did'nt see the array. That's smart! I spent too much time trying to get something like that and I never thought of predefined values in an array XDSo simple...
Klaim
`
Mike Seymour
Pedro d'Aquino
the branch for the condition is expected still.
peterchen
+1  A: 
float mixValue = ... //in range -1.0f to 1.0f
for(... ; ... ; ...  ) //long loop
{
     float inputLevel = ... //in range -1.0f to 1.0f
     float mulValue = mixValue * inputLevel;
     float addValue = mixValue + inputLevel;
     __int32 a = *(__int32*)(&mixValue);
     __int32 b = *(__int32*)(&inputLevel);
     __int32 c = *(__int32*)(&mulValue);
     __int32 d = c & ((a ^ b) | 0x7FFFFFFF);
     mixValue = addValue + *(float*)(&d);
}
uray
+2  A: 

Inspired by Roku's answer (which on MSVC++10 branches), this doesn't seem to branch:

#include <iostream>

using namespace std;
const float sign[] = {-1, 1};
int main() {
    const int N = 10;
    float mixValue = -0.5F;
    for(int i = 0; i < N; i++) {
        volatile float inputLevel = -0.3F;
        int bothNegative = ((((unsigned char*)&inputLevel)[3] & 0x80) & (((unsigned char*)&mixValue)[3] & 0x80)) >> 7;
        mixValue = (mixValue + inputLevel) + (sign[bothNegative]*mixValue*inputLevel);
    }

    std::cout << mixValue << std::endl;
}

Here's the disassembly, as analyzed by IDA Pro (compiled on MSVC++10, Release mode):

Disassembly

Pedro d'Aquino
why u need `volatile`?
uray
Just to make sure the compiler doesn't optimize it away.
Pedro d'Aquino
disclaimer: like most bit twiddling code, it relies on the representation in memory of built-in types (here float) and cannot be assumed to be portable (32 / 64 bits, etc...)
Matthieu M.
A: 

Looking at your code, you see that you will always add the absolute value of mixValue and inputLevel, except when both are positive.

With some bit-fiddling and IEEE floatingpoint knowledge, you may get rid of the conditional:

// sets the first bit of f to zero => makes it positive.
void absf( float& f ) {
   assert( sizeof( float ) == sizeof( int ) );
   reinterpret_cast<int&>( f ) &= ~0x80000000;
}

// returns a first-bit = 1 if f is positive
int pos( float& f ) {
  return ~(reinterpret_cast<int&>(f) & 0x80000000) & 0x80000000;
}

// returns -fabs( f*g ) if f>0 and g>0, fabs(f*g) otherwise.    
float prod( float& f, float& g ) {
  float p = f*g;
  float& rp=p;
  int& ri = reinterpret_cast<int&>(rp);
  absf(p);
  ri |= ( pos(f) & pos(g) & 0x80000000); // first bit = + & +
  return p;
}

int main(){
 struct T { float f, g, r; 
    void test() {
       float p = prod(f,g);
       float d = (p-r)/r;
       assert( -1e-15 < d && d < 1e-15 );
    }
 };
 T vals[] = { {1,1,-1},{1,-1,1},{-1,1,1},{-1,-1,1} };
 for( T* val=vals; val != vals+4; ++val ) {
    val->test();
 }
}

And finally: your loop

for( ... ) {
    mixedResult += inputLevel + prod(mixedResult,inputLevel);
}

Note: the dimensions of your accumulation don't match. The inputLevel is a dimensionless quantity, while mixedResult is your... result (e.g. in Pascal, in Volts, ...). You cannot add two quantities with different dimensions. Probably you want mixedResult += prod( mixedResult, inputLevel ) as your accumulator.

xtofl
A: 

Some compilers (ie MSC) would also require manual sign checking.

Source:

volatile float mixValue;
volatile float inputLevel;

float u   = mixValue*inputLevel;
float v   = -u;
float a[] = { v, u };

mixValue = (mixValue + inputLevel) + a[ (inputLevel<0.0) & (mixValue<0.0) ];

IntelC 11.1:

movss     xmm1, DWORD PTR [12+esp]    
mulss     xmm1, DWORD PTR [16+esp]    
movss     xmm6, DWORD PTR [12+esp]    
movss     xmm2, DWORD PTR [16+esp]    
movss     xmm3, DWORD PTR [16+esp]    
movss     xmm5, DWORD PTR [12+esp]    
xorps     xmm4, xmm4                  
movaps    xmm0, xmm4                  
subss     xmm0, xmm1                  
movss     DWORD PTR [esp], xmm0       
movss     DWORD PTR [4+esp], xmm1     
addss     xmm6, xmm2                  
xor       eax, eax                    
cmpltss   xmm3, xmm4                  
movd      ecx, xmm3                   
neg       ecx                         
cmpltss   xmm5, xmm4                  
movd      edx, xmm5                   
neg       edx                         
and       ecx, edx                    
addss     xmm6, DWORD PTR [esp+ecx*4] 
movss     DWORD PTR [12+esp], xmm6    

gcc 4.5:

flds    32(%esp)
flds    16(%esp)
fmulp   %st, %st(1)
fld     %st(0)
fchs
fstps   (%esp)
fstps   4(%esp)
flds    32(%esp)
flds    16(%esp)
flds    16(%esp)
flds    32(%esp)
fxch    %st(2)
faddp   %st, %st(3)
fldz
fcomi   %st(2), %st
fstp    %st(2)
fxch    %st(1)
seta    %dl
xorl    %eax, %eax
fcomip  %st(1), %st
fstp    %st(0)
seta    %al
andl    %edx, %eax
fadds   (%esp,%eax,4)
xorl    %eax, %eax
fstps   32(%esp)
Shelwien