views:

980

answers:

14

Below is my innermost loop that's run several thousand times, with input sizes of 20 - 1000 or more. This piece of code takes up 99 - 99.5% of execution time. Is there anything I can do to help squeeze any more performance out of this?

I'm not looking to move this code to something like using tree codes (Barnes-Hut), but towards optimizing the actual calculations happening inside, since the same calculations occur in the Barnes-Hut algorithm.

Any help is appreciated!

Edit: I'm running in Windows 7 64-bit with Visual Studio 2008 edition on a Core 2 Duo T5850 (2.16 GHz)

typedef double real;

struct Particle
{
    Vector pos, vel, acc, jerk;
    Vector oldPos, oldVel, oldAcc, oldJerk;
    real mass;
};

class Vector
{
private:
    real vec[3];

public:
    // Operators defined here
};

real Gravity::interact(Particle *p, size_t numParticles)
{
    PROFILE_FUNC();
    real tau_q = 1e300;

    for (size_t i = 0; i < numParticles; i++)
    {
        p[i].jerk = 0;
        p[i].acc = 0;
    }

    for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {
            Vector r = p[j].pos - p[i].pos;
            Vector v = p[j].vel - p[i].vel;
            real r2 = lengthsq(r);
            real v2 = lengthsq(v);

            // Calculate inverse of |r|^3
            real r3i = Constants::G * pow(r2, -1.5);

            // da = r / |r|^3
            // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
            Vector da = r * r3i;
            Vector dj = (v - r * (3 * dot(r, v) / r2)) * r3i;

            // Calculate new acceleration and jerk
            p[i].acc += da * p[j].mass;
            p[i].jerk += dj * p[j].mass;
            p[j].acc -= da * p[i].mass;
            p[j].jerk -= dj * p[i].mass;

            // Collision estimation
            // Metric 1) tau = |r|^2 / |a(j) - a(i)|
            // Metric 2) tau = |r|^4 / |v|^4
            real mij = p[i].mass + p[j].mass;
            real tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
            real tau_est_q2 = (r2*r2) / (v2*v2);

            if (tau_est_q1 < tau_q)
                tau_q = tau_est_q1;
            if (tau_est_q2 < tau_q)
                tau_q = tau_est_q2;
        }
    }

    return sqrt(sqrt(tau_q));
}
A: 

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.

Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.

You may want to profile to see if this is really the bottleneck first.

Joshua
Err.. doing it wrong? You assume that the average assembly programmer is better than a compiler -- which is usually not true.
Billy ONeal
I have profiled my code and can verify this code does account for 99% of the execution time. I was just wondering if there are any top level optimizations left before I start using assembly.
Sagekilla
If you have profiled, is there a line that takes much more than the others?
Vicente Botet Escriba
@Billy I don't know about the average developer but I can almost always take the compiler's output and squeeze a few more % out of it.
Joshua
@Johsua: Really? http://blogs.msdn.com/oldnewthing/archive/2004/12/16/317157.aspx <-- Prime example of why you should leave this to compilers.
Billy ONeal
Never optimize w/o profiling before and after. With something like this one, I'm likely to optimize registers across function calls, something the compiler cannot do.
Joshua
+19  A: 
  1. Inline the calls to lengthsq().

  2. Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5

  3. Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.

4a. Try replacing the if (...) tau_q = with

    tau_q=minimum(...,...)

Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.

4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]

  1. [EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.

Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

Ira Baxter
Or better, `r2*r2*inverse_sqrt(r2)` (choose your platform's appropriate function) avoids division entirely. Or write `r2*r2/sqrt(r2)`, use your compiler's loose math setting, and check the assembly output or just benchmark.
Potatoswatter
I should have mentioned earlier the calls to lengthsq() and other Vector operations are all defined as inline. I will try your second and third suggestion though. Could I get around #3 by telling my compiler that the pointers are not aliased, or would it provide a larger speedup?
Sagekilla
@Potatoswatter: good idea, but that's `pow(r2, +1.5)` - it needs to be `pow(r2, -1.5)`
Paul R
@Sagekilla: you can try telling you compiler that that pointers aren't aliased. The amount of recoding to use the scalars is pretty minimal, and makes you independent of the whatever the crazy interpretation the compiler gives to that switch. When it doubt, play it safe.
Ira Baxter
@Paul, d'oh! Well, `inverse_sqrt(r2)/r2` then.
Potatoswatter
You needn't use assembly to go SIMD; most compilers offer intrinsic functions, which look like C function calls but actually translate to a native opcode.
Crashworks
+1 for aliasing. It might be (minorly) helpful to save Constants::G as a local variable too.
celion
@Celion: Constants::G presumably *is* a constant. How will putting a constant into a local variable do anything but slow the code down?
Ira Baxter
celion
@Ira: What exactly is loop folding? I did some google searching but I couldn't find what it is you're talking about here.
Sagekilla
@Sagekilla: it means merging the bodies of the loops. You can trivially move the initialization out of your first loop, into the head of the second one.
Ira Baxter
@Ira: I don't think I can actually do that. Inside the second pair of for loops I accumulate the acceleration and jerk on p[i] and p[j]. If I pushed the first loop into the body of the second loop I would get incorrect acceleration and jerk, due to how I calculate them.
Sagekilla
@Sagekilla: I think I missed the fact that you are updating p[j].jerk in the loop by misreading it as p[i].jerk. In that case, you're right, you can't do that. You should still use the scalar trick in the loop.
Ira Baxter
+3  A: 
  • Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.

  • You might consider whether you can use floats rather than doubles.

  • If you're using gcc then make sure you're using -O2 or possibly -O3.

  • You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).

  • Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

Paul R
I've looked into the possibility of using floats, which I've decided to take account for by providing the real typedef. In this particular case, I need high speed and accuracy that utilizes full machine precision (to 1E-15)
Sagekilla
As a quick test - try `long double`, just to compare performance with `double`. You never know, it could conceivably be faster.
Steve Jessop
+6  A: 

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.

If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.

Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.

Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.

Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.

One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Michael Dorgan
Yes, my calculations are precise. The tau_q calculation is actually necessary the way it is, due to dimensional analysis. The tau_est_q1 and tau_est_q2 calculations are designed that way so I get the proper units of seconds to the fourth power, which would be converted to seconds at the end.
Sagekilla
Culling your particles on distance possible?
Michael Dorgan
Culling unfortunately is not possible, as this is a pure gravity simulation. All pairwise interactions have to be accounted for. I will at some point try to attain O(n logn) by using the Barnes-Hut algorithm, but my main issue will remain: Can my calculations inside the loop be optimized still?
Sagekilla
+1  A: 

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?

What's real? Can it be a float?

Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

AshleysBrain
`real` is synonymous for `double` in this code. See the `typedef` at the top.
kurige
A: 

Thing I look for is branching, they tend to be performance killers.

You can use loop unrolling.

also, remember multiple with smaller parts of the problem :-

  for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {

is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache

You could thread this to make better use of multiple cores

you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....

but really need to know where its costing you the most

Keith Nicholas
+2  A: 

Would you benefit from the famous "fast inverse square root" algorithm?

float InvSqrt(float x)
{
    union {
        float f;
        int i;
    } tmp;
    tmp.f = x;
    tmp.i = 0x5f3759df - (tmp.i >> 1);
    float y = tmp.f;
    return y * (1.5f - 0.5f * x * y * y);
}

It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

Arrieta
This is not an optimization on almost any current architecture. If you need a fast approximate inverse square root, use a dedicated hardware instruction to get it. (`rsqrtss` on x86 with SSE, `vrsqrte.f32` on ARM with NEON, `vrsqrtefp` on PowerPC with AltiVec). They're all both faster and more accurate.
Stephen Canon
I can verify what Stephen claims. I did some timings of the invsqrt and sqrt then divide, and the invsqrt was not only marginally faster but far more inaccurate. For a similar level of accuracy (by adding more iterations of Newton's method -- That last return step) invsqrt was slower.
Sagekilla
+3  A: 

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

Emile Cormier
+3  A: 

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.


If you move the mass out too, you could store things like this:

struct ParticleData
{
    Vector pos, vel, acc, jerk;
};

ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...

then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.


If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g

struct ParticleData
{
    // data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
    Vector4 data_x;

    // data_y[0] == pos.y, data_y[1] = vel.y, etc.
    Vector4 data_y;

    // data_z[0] == pos.z, data_y[1] = vel.z, etc.
    Vector4 data_z;
};

where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)

Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

celion
Thank you! That's an interesting idea, I'll have to give this a try.Although my code doesn't explicit state it, the oldData is used in my integration functions where I need to keep track of older data. I will try splitting my data and doing that memcpy.
Sagekilla
The memcpy is a small benefit and only O(N). Reducing cache misses in the inner loop is where you'll see the main gain (I hope).
celion
A: 

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.

Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {
            r = p[j].pos - p[i].pos;
            v = p[j].vel - p[i].vel;
            r2 = lengthsq(r);
            v2 = lengthsq(v);

            // Calculate inverse of |r|^3
            r3i = Constants::G * pow(r2, -1.5);

            // da = r / |r|^3
            // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
            da = r * r3i;
            dj = (v - r * (3 * dot(r, v) / r2)) * r3i;

            // Calculate new acceleration and jerk
            p[i].acc += da * p[j].mass;
            p[i].jerk += dj * p[j].mass;
            p[j].acc -= da * p[i].mass;
            p[j].jerk -= dj * p[i].mass;

            // Collision estimation
            // Metric 1) tau = |r|^2 / |a(j) - a(i)|
            // Metric 2) tau = |r|^4 / |v|^4
            mij = p[i].mass + p[j].mass;
            tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
            tau_est_q2 = (r2*r2) / (v2*v2);

            if (tau_est_q1 < tau_q)
                tau_q = tau_est_q1;
            if (tau_est_q2 < tau_q)
                tau_q = tau_est_q2;
        }
    }
DeadMG
As there is no constructor and no members, that vector is a POD and as such, there shouldn't be any initialization code right? One would hope that the compiler would be smart enough to 'register' those values. But you never know...
Michael Dorgan
yes, I'd hope the compiler would perform this. But, better safe than sorry, in my experience. It's generally good practice to re-use reusable variables anyway - if he wrote another loop that involved, say, strings, it's a good habit to get into.
DeadMG
It's 2010. There's absolutely no need to worry about reusing variables. In fact, if you're using a modern compiler, and reuse a variable, it may even internally represent the reuse as a new variable. There's also no need to do something that makes code less readable and provides no performance benefit.
+2  A: 

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.

With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

Adrien
A: 

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.

Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.

Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

StarLight
A: 

You can replace any occurrence of:

a = b/c
d = e/f

with

icf = 1/(c*f)
a = bf*icf
d = ec*icf

if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.

You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

+2  A: 

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.

SSE example:

http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp

Novel cache algorithm, explanation and example code:

http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/

http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp

You might also find the following deck I gave at Seattle Code Camp interesting:

http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/

Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

Ade Miller
Thanks a lot! Those links were extremely useful. I already have another version that's parallelized where I looped n^2 (instead of (n^2 - n) / 2) times so that I could accumulate the ith particle using force of the jth particle. That version had no race conditions and no need for locking because of how I wrote it.
Sagekilla
I just took a look at your NBody code (Are you also reading Art of Computational Science?) and I have one small recommendation: Try writing your integrators as a predict and correct step, and pull out your force calculation. Then you can write it as a PEC scheme (Predict, Evaluate, Correct).
Sagekilla
That's how mine are written too. The novel algorithm gets you back to (n^2 - n) / 2) for parallel code and also lets you fit blocks of the problem into L1 cache reducing cache misses. This made a significant difference for me - nearly 2x. At some point cache misses become by far your biggest overhead.
Ade Miller
Yes, I've read chucks of the Art of Comp Sci. I should really get back to it. Currently everything uses a global timestep. I should play around with that some more and move to individual timesteps. I sort of got sidetracked by the whole CUDA thing which is fast but hard to code for. I have a Hermite integrator in C#, maybe I'll play around with that. Thanks!
Ade Miller