views:

163

answers:

7

Ok I know the title doesn't fully explain this question. So I'm writing a program that performs a large number of calculations and I'm trying to optimize it so that it won't run quite so slow. I have a function that is a member of a class that gets called around 5 million times. This is the function:

void PointCamera::GetRay(float x, float y, Ray& out)
{
    //Find difference between location on view plane and origin and normalize
    float vpPointx = pixelSizex * (x - 0.5f * (float)width);
    float vpPointy = pixelSizey * (((float)height - y) - 0.5f * height);

    //Transform ray to camera's direction
    out.d = u * vpPointx + v * vpPointy - w * lens_distance;
    out.d.Normalize();

    //Set origin to camera location
    out.o = loc;
}

I'm wondering if it is better/faster to declare the variables vpPointx and vpPointy in the class than to declare them each time I call the function. Would this be a good optimization or would it have little effect?

And in general, if there is anything here that could be optimized please let me know.

+9  A: 

By limiting the scope of your variables, you are giving more opportunity to the compiler optimiser to rearrange your code and make it run faster. For example, it might keep the values of those variables entirely within CPU registers, which may be an order of magnitude faster than memory access. Also, if those variables were class instance variables, then the compiler would have to generate code to dereference this every time you accessed them, which would very likely be slower than local variable access.

As always, you should measure the performance yourself and try the code both ways (or better, as many ways as you can think of). All optimisation advice is subject to whatever your compiler actually does, which requires experimentation.

Greg Hewgill
+3  A: 

Always prefer locals

Anything that is a temporary value should be a local. It's possible that such a value can exist entirely within a register without kicking something else out of cache or requiring a pointless memory store that will use a resource in far shorter supply than CPU cycles.

A dual 3 GHz CPU can execute 6 billion CPU cycles per second. In order to approach that 6 billion figure, typically most ops should involve no memory or cache operations and the results of most cycles must not be needed by the next instruction unless the CPU can find a later instruction that is immediately dispatchable. This all gets quite complicated but 6 billions somethings, including some wait states, will certainly happen each second.

However, that same CPU system is capable of only 10-40 million memory operations per second. The disparity is partly compensated for by the cache systems, although they are still slower than the CPU is, they are limited in size, and they do not cope with writes as well as they do with reads.

The good news is that good software abstractions and software speed optimization both agree in this case. Do not store transient state in an object unless you have a reason to reference it later.

DigitalRoss
A: 

Appears to be a raytracer. The little things do add up, but also consider the big hits: You'll get a huge speedup with decent spatial partitioning. Get yourself an octtree or KD-Tree for a few orders magnitude speedup on complex scenes.

As for your direct question: profile it.

Donnie
Yeah it is. I am working on a kd-tree, but I want to optimize as much of it as I can. I just profiled it and it appears that the majority of time is spent in vector operators and constructors. I'm not completely sure how to optimize those because they are just variable initializations and simple math.
Stewart
The idea is that you'll do (a whole lot) less of those operations once you have your KD Tree done, that's why it's such a huge improvement.
Donnie
@Stewart: If a lot of time is spent in constructors, you may want to recycle used objects, so you're not exercising the heap so much. And vector operators, if the dimension is constant, can be unrolled and in-lined.
Mike Dunlavey
+1  A: 

How about precomputing some of those multiplications that never change. For example, w*lens_distance and 0.5*height. Compute them once whenever the variables change, then just use the stored value in this function call.

MichaelM
Who says `lens_distance` and `height` *never* change?
Georg Fritzsche
Good suggestion actually. During the course of rendering a frame they don't change.
Stewart
When I said never change, I meant in the function, since they are not parameters to the function, nor computed in the function. If you precompute the results whenever those particular variables do change (in a setter or whatever), you should be able to cut down on some unnecessary multiplications. Of course, if they change before every call to this function there isn't much point.
MichaelM
+1  A: 

There is a performance penalty for declaring them in the class. They are accessed, in effect, by using this->field. There will be, at minimum, one memory write to store the result. The function local variables could live in registers for their entire lifetime.

Richard Pennington
+1  A: 

I'm not sure, although my guess is it's better inside the function (since it's just a push on the stack to "declare" the variable, whereas making it part of the class means accessing it from memory using indirection every time you access it). Of course, in reality the compiler probably optimizes all of this into registers anyway.

Which brings me to my point:

You're going about this the wrong way

I don't think that anyone can really tell you what will be faster. It shouldn't matter even if someone does. The only real way to optimize is by measuring.

This usually means one of two things:

  1. One option is to try each way, measure the time it takes, and compare. Note that this isn't always trivial to do (since each run will sometimes depend on external factors, difficult memory issues, etc). But running the code a few million times will probably iron that out for you.
  2. Ideally, you should be using a profiler. That's a piece of software designed to measure the code for you, and tell you what parts take the longest amount of time. As most people who have dealt with optimization will tell you, you'll usually be surprised at what takes up a lot of time.

That's why you should always go with the "scientific" method of measuring, instead of relying on anyone's guesswork.

Edan Maor
But we can't measure everything, and it's perfectly reasonable to ask "which will be likely to be faster?".
DigitalRoss
Yes, but as I understood the question, he hasn't tried measuring or profiling, but rather is jumping straight to asking what will be faster. This is not the best thing to do for an actual optimization problem (as opposed to just asking for the sake of learning, which is great).
Edan Maor
Besides, I may just be completely wrong in this, but I don't think that most people can really tell you what is faster. There are so many optimizations at various levels, not to mention the behaviour of the compiler, CPU/cache etc, that I really doubt that in such cases a definitive answer can be given.
Edan Maor
Well I was asking from the standpoint of trying to optimize, but the main reason I asked was because I didn't know theoretically which should be faster. I did profile it. So you're right; that should have given me the answer as to whether or not I should change it.
Stewart
@Edan: You're basically right (if a bit stentorious). Personally, I prefer "capture" to "measure", because "measure" is too non-specific: http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802
Mike Dunlavey
A: 

The others have already covered the benefits of using locals over class variables, so I won't go into that.. but since you asked for optimization tips in general:

  • Your int-to-float cast jumps out at me. There's a cost to it, especially if you are using the x387 FPU. Using SSE registers will make it better, but it looks thoroughly unnecessary for your function: you could simply store a copy of them as floats in your class.
  • You mentioned in a comment that you were still working on your kd-tree. It's probably a better idea to finish that first before doing the low-level optimization; what appears important now may not take up a fraction of the time later.
  • Use an instruction-level profiler, like VTune. gprof doesn't give you anywhere near enough information.
  • Have you heard of ompf.org? It's a wonderful raytracing forum, and you can learn a lot about the relevant optimizations there.
  • See my answer to this post for more tips.
  • Read Agner Fog.

As an aside: I've heard that the Bounding Interval Hierarchy is much easier to implement. I've not implemented a kd-tree, but I have implemented a BIH, and I'd say it's reasonably straightforward.

int3