ansaurus

Question

Force compiler to not optimize side-effect-less statements

Answer 1

+2 A:

a function call incurs quite a bit of overhead, so I would remove this anyway.

adding a dummy += i; is no problem, as long as you keep this same bit of code in the alternate profile too. (So the code you are comparing it against).

Last but not least: generate asm code. Even if you can not code in asm, the generated code is typically understandable since it will have labels and commented C code behind it. So you know (sortoff) what happens, and which bits are kept.

R

p.s. found this too:

 inline float pslNegFabs32f(float x){
       __asm{
         fld  x //Push 'x' into st(0) of FPU stack
         fabs
         fchs   //change sign
         fstp x //Pop from st(0) of FPU stack
        }
       return x;
 }

supposedly also very fast. You might want to profile this too. (although it is hardly portable code)

Toad 2009-07-20 08:38:15

Well, generating and reading assembly is fine by me, but...there will be no assembly generated.

GMan 2009-07-20 08:41:34

My point is: by creating asm output (a compiler option). You can see the actual asm code and this way be sure if any statements are indeed optimized away

Toad 2009-07-20 08:47:08

Ah, I read that wrong, I thought you meant to check the optimizations.

GMan 2009-07-20 08:49:08

inline float pslNegFabs32f(float x){ __asm{ fld x //Push 'x' into st(0) of FPU stack fabs fchs //change sign fstp x //Pop from st(0) of FPU stack } return x; } bumped into this one.... maybe even faster ;^)

Toad 2009-07-20 09:22:31

ah...sorry that the indentation goes berzerk in the comments section...I'll add it to my original answer

Toad 2009-07-20 09:23:16

Ha, that's ok. Use the `inline code tags`, they work in comments now :) Though that won't do any justice to tabs. I'll comment on the results of `fabs()` versus `mask_abs()` versus your hack soonish.

GMan 2009-07-20 09:26:04

I think the fchs should be removed by the way... I looks as if they do a fabs and then make it negative... e.g. they force the value to be negative.. (while you want the value to be positive). My X86 asm is too bad to know for sure. I used to program 68Kasm. ;^)

Toad 2009-07-20 09:39:04

Yea, I've done a bit of work with assembly. A professor made me write the Mandelbrot set loop by hand :| And yes, `fabs` will mask the sign, and `fchs` reverses it.

GMan 2009-07-20 09:48:47

And for the record, the assembly version ran fastest, followed by the mask, and followed by the standard fabs call. keep in mind when I say "fastest", this is hundredths of milliseconds difference on average, over a couple million iterators. It does add up, though. `fabs` took a bit over a minute longer than the assembly version to go through one million absolute value conversions. Interesting result, though, I'll look around for some compiler flags to try and get VC to generate the same assembly code.

GMan 2009-07-20 10:33:24

Answer 2

+4 A:

So I want to make sure when I profile, I'm not getting skewed results. As such, I'd like to make sure the compiler does not optimize out statements

You are by definition skewing the results.

Here's how to fix the problem of trying to profile "dummy" code that you wrote just to test: For profiling, save your results to a global/static array and print one member of the array to the output at the end of the program. The compiler will not be able to optimize out any of the computations that placed values in the array, but you'll still get any other optimizations it can put in to make the code fast.

280Z28 2009-07-20 08:39:34

Answer 3

A:

A micro-benchmark around this statement will not be representative of using this approach in a genuine scenerio; the surrounding instructions and their affect on the pipeline and cache are generally as important as any given statement in itself.

Will 2009-07-20 08:41:39

Agreed, things like this rarely matter alone. I'm just seeing if there is still a positive difference between bit hacks and standard calls.

GMan 2009-07-20 08:54:18

Answer 4

+3 A:

Compilers are unfortunately allowed to optimise as much as they like, even without any explicit switches, if the code behaves as if no optimisation takes place. However, you can often trick them into not doing so if you indicate that value might be used later, so I would change your code to:

int float_to_int(float f)
{
    return static_cast<int>(f); // has no side-effects
}

As others have suggested, you will need to examine the assemnler output to check that this approach actually works.

anon 2009-07-20 08:43:09

I'm going to accept this one. All of these answers were helpful, but this was the first one to suggest returning the value which I eventually did.

GMan 2009-07-20 18:59:36

It is very likely that this function will be inlined into the caller and then removed. You might need to define it in a different .cpp file and make sure that all the linker optimisations are turned off.

Tom Leys 2009-07-20 21:10:30

Answer 5

+4 A:

In this case I suggest you make the function return the integer value:

int float_to_int(float f)
{
   return static_cast<int>(f);
}

Your calling code can then exercise it with a printf to guarantee it won't optimize it out. Also make sure float_to_int is in a separate compilation unit so the compiler can't play any tricks.

extern int float_to_int(float f)
int sum = 0;
// start timing here
for (int i = 0; i < 1000000; i++)
{
   sum += float_to_int(1.0f);
}
// end timing here
printf("sum=%d\n", sum);

Now compare this to an empty function like:

int take_float_return_int(float /* f */)
{
   return 1;
}

Which should also be external.

The difference in times should give you an idea of the expense of what you're trying to measure.

George Phillips 2009-07-20 08:43:48

Answer 6

+2 A:

You just need to skip to the part where you learn something and read the published Intel CPU optimisation manual.

These quite clearly state that casting between float and int is a really bad idea because it requires a store from the int register to memory followed by a load into a float register. These operations cause a bubble in the pipeline and waste many precious cycles.

Tom Leys 2009-07-20 08:46:13

Harsh but fair :)

AakashM 2009-07-20 09:25:32

Did I not say I'm not actually worried about this in production code, and was just curious? Or that there were other functions? This un-answer doesn't answer the question at all.

GMan 2009-07-20 09:47:08

No offence intended GMan. Many other answers address your question directly. I am trying to address your thirst for knowledge. I digest such manuals whole - http://people.redhat.com/drepper/cpumemory.pdf - for example shows how memory access is far more important than CPU for most applications.

Tom Leys 2009-07-20 21:09:14

Answer 7

A:

GCC 4 does a lot of micro-optimizations now, that GCC 3.4 has never done. GCC4 includes a tree vectorizer that turns out to do a very good job of taking advantage of SSE and MMX. It also uses the GMP and MPFR libraries to assist in optimizing calls to things like sin(), fabs(), etc., as well as optimizing such calls to their FPU, SSE or 3D Now! equivalents.

I know the Intel compiler is also extremely good at these kinds of optimizations.

My suggestion is to not worry about micro-optimizations like this - on relatively new hardware (anything built in the last 5 or 6 years), they're almost completely moot.

Edit: On recent CPUs, the FPU's fabs instruction is far faster than a cast to int and bit mask, and the fsin instruction is generally going to be faster than precalculating a table or extrapolating a Taylor series. A lot of the optimizations you would find in, for example, "Tricks of the Game Programming Gurus," are completely moot, and as pointed out in another answer, could potentially be slower than instructions on the FPU and in SSE.

All of this is due to the fact that newer CPUs are pipelined - instructions are decoded and dispatched to fast computation units. Instructions no longer run in terms of clock cycles, and are more sensitive to cache misses and inter-instruction dependencies.

Check the AMD and Intel processor programming manuals for all the gritty details.

greyfade 2009-07-20 09:12:54

Good to know that I should upgrade to 4 :P And I tried to make it clear but I guess not: I'm not worried at all. I'm usually the one telling others to not worry about tiny things like this. I'm just curious what differences there are now, on modern CPU's.

GMan 2009-07-20 09:20:46

I'll edit my answer

greyfade 2009-07-20 09:28:15

Answer 8

+2 A:

Assignment to a volatile variable shold never be optimized away, so this might give you the result you want:

static volatile int i = 0;

void float_to_int(float f)
{
    i = static_cast<int>(f); // has no side-effects
}

finnw 2009-07-20 09:24:56

A volatile variable would be changed externally so that the compiler has to make sure it uses the value from the original memory location. Apart from that compiler is free to generate optimised code to convert a float to int (or any other operation).

Indeera 2009-07-20 09:37:15

Answer 9

+1 A:

What always worked on all compilers I used so far:

extern volatile int writeMe = 0;

void float_to_int(float f)
{    
  writeMe = static_cast<int>(f); 
}

note that this skews results, boith methods should write to writeMe.

volatile tells the compiler "the value may be accessed without your notice", thus the compiler cannot omit the calculation and drop the result. To block propagiation of input constants, you might need to run them through an extern volatile, too:

extern volatile float readMe = 0;
extern volatile int writeMe = 0;

void float_to_int(float f)
{    
  writeMe = static_cast<int>(f); 
}

int main()
{
  readMe = 17;
  float_to_int(readMe);
}

Still, all optimizations inbetween the read and the write can be applied "with full force". The read and write to the global variable are often good "fenceposts" when inspecting the generated assembly.

Without the extern the compiler may notice that a reference to the variable is never taken, and thus determine it can't be volatile. Technically, with Link Time Code Generation, it might not be enough, but I haven't found a compiler that agressive. (For a compiler that indeed removes the access, the reference would need to be passed to a function in a DLL loaded at runtime)

peterchen 2009-07-20 10:03:33

Answer 10

+1 A:

Return the value?

int float_to_int(float f)
{
    return static_cast<int>(f); // has no side-effects
}

and then at the call site, you can sum all the return values up, and print out the result when the benchmark is done. The usual way to do this is to somehow make sure you depend on the result.

You could use a global variable instead, but it seems like that'd generate more cache misses. Usually, simply returning the value to the caller (and making sure the caller actually does something with it) does the trick.

jalf 2009-07-20 13:27:38

ansaurus

tags:

views:

answers:

Force compiler to not optimize side-effect-less statements

Edit

Edit Again

related questions