tags:

views:

375

answers:

8

I am currently writing some glsl like vector math classes in c++, and I just implemented an abs() function like this:

    template<class T>
    static inline T abs(T _a)
    {
        return _a < 0 ? -_a : _a;
    }

I compared its speed to the default c++ abs from math.h like this:

clock_t begin = clock();
    for(int i=0; i<10000000; ++i)
    {
        float a = abs(-1.25);
    };

    clock_t end = clock();
    unsigned long time1 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));

    begin = clock();
    for(int i=0; i<10000000; ++i)
    {
        float a  = myMath::abs(-1.25);
    };
    end = clock();
    unsigned long time2 = (unsigned long)((float)(end-begin) / ((float)CLOCKS_PER_SEC/1000.0));

    std::cout<<time1<<std::endl;
    std::cout<<time2<<std::endl;

Now the default abs takes about 25ms while mine takes 60. I guess there is some low level optimisation going on. Does anybody know how math.h abs works internally? The performance difference is nothing dramatic, but I am just curious!

+1  A: 

Probably the library version of abs is an intrinsic function, whose behavior is exactly known by the compiler, which can even compute the value at compile time (since in your case it's known) and optimize the call away. You should try your benchmark with a value known only at runtime (provided by the user or got with rand() before the two cycles).

If there's still a difference, it may be because the library abs is written directly in hand-forged assembly with magic tricks, so it could be a little faster than the generated one.

Matteo Italia
+3  A: 

It probably just uses a bitmask to set the sign bit to 0.

sepp2k
hey, I am not very experienced in bitwise programming, how would that work?
moka
@moka: See http://en.wikipedia.org/wiki/Double_precision_floating-point_format.
Andreas Rejbrand
+7  A: 

Since they are the implementation, they are free to make as many assumptions as they want. They know the format of the double and can play tricks with that instead.

Likely (as in almost not even a question), your double is the binary64 format. This means the sign has it's own bit, and an absolute value is merely clearing that bit. For example, as a specialization, a compiler implementor may do:

template <>
inline double abs<double>(double _a)
{
    // uint64_t could be anything guaranteed to be 64-bit
    // note: boost provides boost::uint64_t

    uint64_t& i = reinterpret_cast<uint64_t&>(_a);
    i &= 0x7FFFFFFFFFFFFFFF; // clear sign bit

    return reinterpret_cast<double&>(i);
}

This removes branching and may run faster.

GMan
Wouldn't that be a blatant example of undefined behaviour?
jpalecek
@jpalecek: I don't see any UB in there. `reinterpret_cast` has implementation-defined behavior. (Hence my point of, "they can make whatever implementation-specific assumptions they want, since they are the implementation.") Non-library coders cannot. Once you write this code, you are assuming certain things about your implementation.
GMan
If the double is not IEEE-754, yes. It is at the very least less portable
Yann Ramin
Couldn't you just return `a`?
FredOverflow
The point is that the implementation can write this code, but you can't.
rlbond
@Fred: I don't see what you mean.
GMan
@GMan: I'm not talking about `reinterpret_cast`, what I had in mind is 3.10.15 (strict aliasing violation): If a program attempts to access the stored value of an object through an lvalue of other than one of the following types the behavior is undefined: (here comes an enumeration of cases). Accessing a `double` though an lvalue of type `uint64_t` is not mentioned in the enumeration.
jpalecek
@Jpalecek: It may be undefined too you. But to the implementers of the library it is well defined (as the maths library is tightly coupled to the compiler). They know what the implementation is doing and as such can perform certain optimizations based on this knowledge. You on the other hand can not make any assumptions as your code is not coupled to the compiler.
Martin York
@jpalecek: As @Martin points out, there's a reason my entire answer starts with "Since they are the implementation, they are free to make as many assumptions as they want."
GMan
jpalecek
@GMan: Maybe they can make assumptions about behaviour of the compiler above the C standard (which may or may not be wise), but you haven't made enough assumptions for your code to work.
jpalecek
@jpalecek: I edited my answer, is it clearer now? A compiler vendor can assume whatever they want because they are the compiler. When you compile for a specific platform, you *know* what assumptions you can and cannot make.
GMan
@GMan: No. This question is tagged c++, and if you want to make any assumptions beyond the standard, you have to write them, _specifically_. With your "compiler writer may assume anything doctrine", you may as well say that `d/0` returns the absolute value of `d`. I hope you agree this would be just BS, not a useful solution.
jpalecek
@jpalecek: I don't see how you fail to see the point of the question and answer. The question is "why is this implementation faster?" I'm answering with "because they can make whatever assumptions they want, here's an example." It doesn't get much simpler then that. Sure it's tagged C++ because the question is about C++. But it's also about implementing C++ compiler-wise. And yes, if `d/0` returns the absolute value on some platform, the compiler vendor can and should take advantage of that. I really don't get how hard it is to wrap your head around "implementors can assume whatever they want"
GMan
@jpalecek: Actually; I think it is you that seems to be missing the point.
Martin York
+2  A: 

There can be several things:

  • are you sure the first call uses std::abs? It could just as well use the integer abs from C (either call std::abs explicitely, or have using std::abs;)

  • the compiler might have intrinsic implementation of some float functions (eg. compile them directly into FPU instructions)

However, I'm surprised the compiler doesn't eliminate the loop altogether - since you don't do anything with any effect inside the loop, and at least in case of abs, the compiler should know there are no side-effects.

jpalecek
+4  A: 

There are well-known tricks for computing the absolute value of a two's complement signed number. If the number is negative, flip all the bits and add 1, that is, xor with -1 and subtract -1. If it is positive, do nothing, that is, xor with 0 and subtract 0.

int my_abs(int x)
{
    int s = x >> 31;
    return (x ^ s) - s;
}
FredOverflow
floating point values are not stored in two's complement, but this is a clever trick for absolute value. (For determining sign, I do the same first step but then return `s | 1`; -1 = negative, +1 = positive)
dash-tom-bang
This is good, albeit off-topic here...
jpalecek
@jpal Fair enough, but where exactly does moka state he is only interested in floating point types? Maybe `template<class T>` should be replaced with `template<class Floating>` then, just to be clear on this?
FredOverflow
+1  A: 

Your version of abs is inlined and can be computed once and the compiler can trivially know that the value returned isn't going to change, so it doesn't even need to call the function.

You really need to look at the generated assembly code (set a breakpoint, and open the "large" debugger view, this disassembly will be on the bottom left if memory serves), and then you can see what's going on.

You can find documentation on your processor online without too much trouble, it'll tell you what all of the instructions are so you can figure out what's happening. Alternatively, paste it here and we'll tell you. ;)

dash-tom-bang
+4  A: 

What is your compiler and settings? I'm sure MS and GCC implement "intrinsic functions" for many math and string operations.

The following line:

printf("%.3f", abs(1.25));

falls into the following "fabs" code path (in msvcr90d.dll):

004113DE  sub         esp,8 
004113E1  fld         qword ptr [__real@3ff4000000000000 (415748h)] 
004113E7  fstp        qword ptr [esp] 
004113EA  call        abs (4110FFh) 

abs call the C runtime 'fabs' implementation on MSVCR90D (rather large):

102F5730  mov         edi,edi 
102F5732  push        ebp  
102F5733  mov         ebp,esp 
102F5735  sub         esp,14h 
102F5738  fldz             
102F573A  fstp        qword ptr [result] 
102F573D  push        0FFFFh 
102F5742  push        133Fh 
102F5747  call        _ctrlfp (102F6140h) 
102F574C  add         esp,8 
102F574F  mov         dword ptr [savedcw],eax 
102F5752  movzx       eax,word ptr [ebp+0Eh] 
102F5756  and         eax,7FF0h 
102F575B  cmp         eax,7FF0h 
102F5760  jne         fabs+0D2h (102F5802h) 
102F5766  sub         esp,8 
102F5769  fld         qword ptr [x] 
102F576C  fstp        qword ptr [esp] 
102F576F  call        _sptype (102F9710h) 
102F5774  add         esp,8 
102F5777  mov         dword ptr [ebp-14h],eax 
102F577A  cmp         dword ptr [ebp-14h],1 
102F577E  je          fabs+5Eh (102F578Eh) 
102F5780  cmp         dword ptr [ebp-14h],2 
102F5784  je          fabs+77h (102F57A7h) 
102F5786  cmp         dword ptr [ebp-14h],3 
102F578A  je          fabs+8Fh (102F57BFh) 
102F578C  jmp         fabs+0A8h (102F57D8h) 
102F578E  push        0FFFFh 
102F5793  mov         ecx,dword ptr [savedcw] 
102F5796  push        ecx  
102F5797  call        _ctrlfp (102F6140h) 
102F579C  add         esp,8 
102F579F  fld         qword ptr [x] 
102F57A2  jmp         fabs+0F8h (102F5828h) 
102F57A7  push        0FFFFh 
102F57AC  mov         edx,dword ptr [savedcw] 
102F57AF  push        edx  
102F57B0  call        _ctrlfp (102F6140h) 
102F57B5  add         esp,8 
102F57B8  fld         qword ptr [x] 
102F57BB  fchs             
102F57BD  jmp         fabs+0F8h (102F5828h) 
102F57BF  mov         eax,dword ptr [savedcw] 
102F57C2  push        eax  
102F57C3  sub         esp,8 
102F57C6  fld         qword ptr [x] 
102F57C9  fstp        qword ptr [esp] 
102F57CC  push        15h  
102F57CE  call        _handle_qnan1 (102F98C0h) 
102F57D3  add         esp,10h 
102F57D6  jmp         fabs+0F8h (102F5828h) 
102F57D8  mov         ecx,dword ptr [savedcw] 
102F57DB  push        ecx  
102F57DC  fld         qword ptr [x] 
102F57DF  fadd        qword ptr [__real@3ff0000000000000 (1022CF68h)] 
102F57E5  sub         esp,8 
102F57E8  fstp        qword ptr [esp] 
102F57EB  sub         esp,8 
102F57EE  fld         qword ptr [x] 
102F57F1  fstp        qword ptr [esp] 
102F57F4  push        15h  
102F57F6  push        8    
102F57F8  call        _except1 (102F99B0h) 
102F57FD  add         esp,1Ch 
102F5800  jmp         fabs+0F8h (102F5828h) 
102F5802  mov         edx,dword ptr [ebp+0Ch] 
102F5805  and         edx,7FFFFFFFh 
102F580B  mov         dword ptr [ebp-0Ch],edx 
102F580E  mov         eax,dword ptr [x] 
102F5811  mov         dword ptr [result],eax 
102F5814  push        0FFFFh 
102F5819  mov         ecx,dword ptr [savedcw] 
102F581C  push        ecx  
102F581D  call        _ctrlfp (102F6140h) 
102F5822  add         esp,8 
102F5825  fld         qword ptr [result] 
102F5828  mov         esp,ebp 
102F582A  pop         ebp  
102F582B  ret   

In release mode, the FPU FABS instruction is used instead (takes 1 clock cycle only on FPU >= Pentium), the dissasembly output is:

00401006  fld         qword ptr [__real@3ff4000000000000 (402100h)] 
0040100C  sub         esp,8 
0040100F  fabs             
00401011  fstp        qword ptr [esp] 
00401014  push        offset string "%.3f" (4020F4h) 
00401019  call        dword ptr [__imp__printf (4020A0h)] 
Hernán
A: 

The library abs function operates on integers while you are obviously testing floats. This means that call to abs with float argument involves conversion from float to int (may be a no-op as you are using constant and compiler may do it at compile time), then INTEGER abs operation and conversion int->float. You templated function will involve operations on floats and this is probably making a difference.

Tomek