views:

500

answers:

10

I was reading some old game programming books and as some of you might know, back in that day it was usually faster to do bit hacks than do things the standard way. (Converting float to int, mask sign bit, convert back for absolute value, instead of just calling fabs(), for example)

Nowadays is almost always better to just use the standard library math functions, since these tiny things are hardly the cause of most bottlenecks anyway.

But I still want to do a comparison, just for curiosity's sake. So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements that have no side effect, such as:

void float_to_int(float f)
{
    int i = static_cast<int>(f); // has no side-effects
}

Is there a way to do this? As far as I can tell, doing something like i += 10 will still have no side-effect and as such won't solve the problem.

The only thing I can think of is having a global variable, int dummy;, and after the cast doing something like dummy += i, so the value of i is used. But I feel like this dummy operation will get in the way of the results I want.

I'm using Visual Studio 2008 / G++ (3.4.4).

Edit

To clarify, I would like to have all optimizations maxed out, to get good profile results. The problem is that with this the statements with no side-effect will be optimized out, hence the situation.

Edit Again

To clarify once more, read this: I'm not trying to micro-optimize this in some sort of production code.

We all know that the old tricks aren't very useful anymore, I'm merely curious how not useful they are. Just plain curiosity. Sure, life could go on without me knowing just how these old hacks perform against modern day CPU's, but it never hurts to know.

So telling me "these tricks aren't useful anymore, stop trying to micro-optimize blah blah" is an answer completely missing the point. I know they aren't useful, I don't use them.

Premature quoting of Knuth is the root of all annoyance.

+2  A: 

a function call incurs quite a bit of overhead, so I would remove this anyway.

adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).

Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.

R

p.s. found this too:

 inline float pslNegFabs32f(float x){
       __asm{
         fld  x //Push 'x' into st(0) of FPU stack
         fabs
         fchs   //change sign
         fstp x //Pop from st(0) of FPU stack
        }
       return x;
 }

supposedly also very fast. You might want to profile this too. (although it is hardly portable code)

Toad
Well, generating and reading assembly is fine by me, but...there will be no assembly generated.
GMan
My point is: by creating asm output (a compiler option). You can see the actual asm code and this way be sure if any statements are indeed optimized away
Toad
Ah, I read that wrong, I thought you meant to check the optimizations.
GMan
inline float pslNegFabs32f(float x){ __asm{ fld x //Push 'x' into st(0) of FPU stack fabs fchs //change sign fstp x //Pop from st(0) of FPU stack } return x; } bumped into this one.... maybe even faster ;^)
Toad
ah...sorry that the indentation goes berzerk in the comments section...I'll add it to my original answer
Toad
Ha, that's ok. Use the `inline code tags`, they work in comments now :) Though that won't do any justice to tabs. I'll comment on the results of `fabs()` versus `mask_abs()` versus your hack soonish.
GMan
I think the fchs should be removed by the way... I looks as if they do a fabs and then make it negative... e.g. they force the value to be negative.. (while you want the value to be positive). My X86 asm is too bad to know for sure. I used to program 68Kasm. ;^)
Toad
Yea, I've done a bit of work with assembly. A professor made me write the Mandelbrot set loop by hand :| And yes, `fabs` will mask the sign, and `fchs` reverses it.
GMan
And for the record, the assembly version ran fastest, followed by the mask, and followed by the standard fabs call. keep in mind when I say "fastest", this is hundredths of milliseconds difference on average, over a couple million iterators. It does add up, though. `fabs` took a bit over a minute longer than the assembly version to go through one million absolute value conversions. Interesting result, though, I'll look around for some compiler flags to try and get VC to generate the same assembly code.
GMan
+4  A: 

So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements

You are by definition skewing the results.

Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.

280Z28
A: 

A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.

Will
Agreed, things like this rarely matter alone. I'm just seeing if there is still a positive difference between bit hacks and standard calls.
GMan
+3  A: 

Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:

int float_to_int(float f)
{
    return static_cast<int>(f); // has no side-effects
}

As others have suggested, you will need to examine the assemnler output to check that this approach actually works.

anon
I'm going to accept this one. All of these answers were helpful, but this was the first one to suggest returning the value which I eventually did.
GMan
It is very likely that this function will be inlined into the caller and then removed. You might need to define it in a different .cpp file and make sure that all the linker optimisations are turned off.
Tom Leys
+4  A: 

In this case I suggest you make the function return the integer value:

int float_to_int(float f)
{
   return static_cast<int>(f);
}

Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.

extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
   sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);

Now compare this to an empty function like:

int take_float_return_int(float /* f */)
{
   return 1;
}

Which should also be external.

The difference in times should give you an idea of the expense of what you're trying to measure.

George Phillips
+2  A: 

You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.

These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.

Tom Leys
Harsh but fair :)
AakashM
Did I not say I'm not actually worried about this in production code, and was just curious? Or that there were other functions? This un-answer doesn't answer the question at all.
GMan
No offence intended GMan. Many other answers address your question directly. I am trying to address your thirst for knowledge. I digest such manuals whole - http://people.redhat.com/drepper/cpumemory.pdf - for example shows how memory access is far more important than CPU for most applications.
Tom Leys
A: 

GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.

I know the Intel compiler is also extremely good at these kinds of optimizations.

My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.

Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.

All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.

Check the AMD and Intel processor programming manuals for all the gritty details.

greyfade
Good to know that I should upgrade to 4 :P And I tried to make it clear but I guess not: I'm not worried at all. I'm usually the one telling others to not worry about tiny things like this. I'm just curious what differences there are now, on modern CPU's.
GMan
I'll edit my answer
greyfade
+2  A: 

Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:

static volatile int i = 0;

void float_to_int(float f)
{
    i = static_cast<int>(f); // has no side-effects
}
finnw
A volatile variable would be changed externally so that the compiler has to make sure it uses the value from the original memory location. Apart from that compiler is free to generate optimised code to convert a float to int (or any other operation).
Indeera
+1  A: 

What always worked on all compilers I used so far:

extern volatile int writeMe = 0;

void float_to_int(float f)
{    
  writeMe = static_cast<int>(f); 
}

note that this skews results, boith methods should write to writeMe.

volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:

extern volatile float readMe = 0;
extern volatile int writeMe = 0;

void float_to_int(float f)
{    
  writeMe = static_cast<int>(f); 
}

int main()
{
  readMe = 17;
  float_to_int(readMe);
}

Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.

Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)

peterchen
+1  A: 

Return the value?

int float_to_int(float f)
{
    return static_cast<int>(f); // has no side-effects
}

and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.

You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.

jalf