ansaurus

Question

Can my loop be optimized any more? (C++)

Answer 1

A:

Yes. Try looking at the assembly output. It may yield clues as to where the compiler is doing it wrong.

Now then, always always apply algorithm optimizations first and only when no faster algorithm is available should you go piecemeal optimization by assembly. And then, do inner loops first.

You may want to profile to see if this is really the bottleneck first.

Joshua 2010-04-27 22:45:35

Err.. doing it wrong? You assume that the average assembly programmer is better than a compiler -- which is usually not true.

Billy ONeal 2010-04-27 22:47:32

I have profiled my code and can verify this code does account for 99% of the execution time. I was just wondering if there are any top level optimizations left before I start using assembly.

Sagekilla 2010-04-27 22:49:32

If you have profiled, is there a line that takes much more than the others?

Vicente Botet Escriba 2010-04-27 23:14:04

@Billy I don't know about the average developer but I can almost always take the compiler's output and squeeze a few more % out of it.

Joshua 2010-04-28 02:41:03

@Johsua: Really? http://blogs.msdn.com/oldnewthing/archive/2004/12/16/317157.aspx <-- Prime example of why you should leave this to compilers.

Billy ONeal 2010-04-28 03:16:41

Never optimize w/o profiling before and after. With something like this one, I'm likely to optimize registers across function calls, something the compiler cannot do.

Joshua 2010-04-28 03:37:44

Answer 2

+19 A:

Inline the calls to lengthsq().
Change pow(r2,-1.5) to 1/(r2*sqrt(r2)) to lower the cost of the computing r^1.5
Use scalars (p_i_acc, etc.) inside the innner most loop rather than p[i].acc to collect your result. The compiler may not know that p[i] isn't aliased with p[j], and that might force addressing of p[i] on each loop iteration unnecessarily.

4a. Try replacing the if (...) tau_q = with

    tau_q=minimum(...,...)

Many compilers recognize the mininum function as one they can do with predicated operations rather than real branches, avoiding pipeline flushes.

4b. [EDIT to split 4a and 4b apart] You might consider storing tau_..q2 instead as tau_q, and comparing against r2/v2 rather than r2*r2/v2*v2. Then you avoid doing two multiplies for each iteration in the inner loop, in trade for a single squaring operation to compute tau..q2 at the end. To do this, collect minimums of tau_q1 and tau_q2 (not squared) separately, and take the minimum of those results in a single scalar operation on completion of the loop]

[EDIT: I suggested the following, but in fact it isn't valid for the OP's code, because of the way he updates in the loop.] Fold the two loops together. With the two loops and large enough set of particles, you thrash the cache and force a refetch from non-cache of those initial values in the second loop. The fold is trivial to do.

Beyond this you need to consider a) loop unrolling, b) vectorizing (using SIMD instructions; either hand coding assembler or using the Intel compiler, which is supposed to be pretty good at this [but I have no experience with it], and c) going multicore (using OpenMP).

Ira Baxter 2010-04-27 22:47:25

Or better, `r2*r2*inverse_sqrt(r2)` (choose your platform's appropriate function) avoids division entirely. Or write `r2*r2/sqrt(r2)`, use your compiler's loose math setting, and check the assembly output or just benchmark.

Potatoswatter 2010-04-27 22:53:26

I should have mentioned earlier the calls to lengthsq() and other Vector operations are all defined as inline. I will try your second and third suggestion though. Could I get around #3 by telling my compiler that the pointers are not aliased, or would it provide a larger speedup?

Sagekilla 2010-04-27 22:55:00

@Potatoswatter: good idea, but that's `pow(r2, +1.5)` - it needs to be `pow(r2, -1.5)`

Paul R 2010-04-27 22:57:51

@Sagekilla: you can try telling you compiler that that pointers aren't aliased. The amount of recoding to use the scalars is pretty minimal, and makes you independent of the whatever the crazy interpretation the compiler gives to that switch. When it doubt, play it safe.

Ira Baxter 2010-04-27 23:05:03

@Paul, d'oh! Well, `inverse_sqrt(r2)/r2` then.

Potatoswatter 2010-04-27 23:11:17

You needn't use assembly to go SIMD; most compilers offer intrinsic functions, which look like C function calls but actually translate to a native opcode.

Crashworks 2010-04-27 23:29:30

+1 for aliasing. It might be (minorly) helpful to save Constants::G as a local variable too.

celion 2010-04-27 23:36:27

@Celion: Constants::G presumably *is* a constant. How will putting a constant into a local variable do anything but slow the code down?

Ira Baxter 2010-04-27 23:46:46

celion 2010-04-28 00:05:13

@Ira: What exactly is loop folding? I did some google searching but I couldn't find what it is you're talking about here.

Sagekilla 2010-04-28 13:25:44

@Sagekilla: it means merging the bodies of the loops. You can trivially move the initialization out of your first loop, into the head of the second one.

Ira Baxter 2010-04-28 14:02:24

@Ira: I don't think I can actually do that. Inside the second pair of for loops I accumulate the acceleration and jerk on p[i] and p[j]. If I pushed the first loop into the body of the second loop I would get incorrect acceleration and jerk, due to how I calculate them.

Sagekilla 2010-04-28 14:11:42

@Sagekilla: I think I missed the fact that you are updating p[j].jerk in the loop by misreading it as p[i].jerk. In that case, you're right, you can't do that. You should still use the scalar trick in the loop.

Ira Baxter 2010-04-28 14:51:11

Answer 3

+3 A:

Firstly you need to profile the code. The method for this will depend on what CPU and OS you are running.
You might consider whether you can use floats rather than doubles.
If you're using gcc then make sure you're using -O2 or possibly -O3.
You might also want to try a good compiler, like Intel's ICC (assuming this is running on x86 ?).
Again assuming this is (Intel) x86, if you have a 64-bit CPU then build a 64-bit executable if you're not already - the extra registers can make a noticeable difference (around 30%).

Paul R 2010-04-27 22:48:17

I've looked into the possibility of using floats, which I've decided to take account for by providing the real typedef. In this particular case, I need high speed and accuracy that utilizes full machine precision (to 1E-15)

Sagekilla 2010-04-27 22:52:32

As a quick test - try `long double`, just to compare performance with `double`. You never know, it could conceivably be faster.

Steve Jessop 2010-04-28 00:39:55

Answer 4

+6 A:

This line real r3i = Constants::G * pow(r2, -1.5); is going to hurt. Any kind of sqrt lookup or platform specific help with a square root would help.

If you have simd abilities, breaking up your vector subtracts and squares into its own loop and computing them all at once will help a bit. Same for your mass/jerk calcs.

Something that comes to mind is - are you keeping enough precision with your calc? Taking things to the 4th power and 4th root really thrash your available bits through the under/overflow blender. I'd be sure that your answer is indeed your answer when complete.

Beyond that, it's a math heavy function that will require some CPU time. Assembler optimization of this isn't going to yield too much more than the compiler can already do for you.

Another thought. As this appears to be gravity related, is there any way to cull your heavy math based on a distance check? Basically, a radius/radius squared check to fight the O(n^2) behavior of your loop. If you elimiated 1/2 your particles, it would run around x4 faster.

One last thing. You could thread your inner loop to multiple processors. You'd have to make a seperate version of your internals per thread to prevent data contention and locking overhead, but once each thread was complete, you could tally your mass/jerk values from each structure. I didn't see any dependencies that would prevent this, but I am no expert in this area by far :)

Michael Dorgan 2010-04-27 22:49:11

Yes, my calculations are precise. The tau_q calculation is actually necessary the way it is, due to dimensional analysis. The tau_est_q1 and tau_est_q2 calculations are designed that way so I get the proper units of seconds to the fourth power, which would be converted to seconds at the end.

Sagekilla 2010-04-27 23:07:18

Culling your particles on distance possible?

Michael Dorgan 2010-04-27 23:08:20

Culling unfortunately is not possible, as this is a pure gravity simulation. All pairwise interactions have to be accounted for. I will at some point try to attain O(n logn) by using the Barnes-Hut algorithm, but my main issue will remain: Can my calculations inside the loop be optimized still?

Sagekilla 2010-04-27 23:12:24

Answer 5

+1 A:

Apart from straightforward add/subtract/divide/multiply, pow() is the only heavyweight function I see in the loop body. It's probably pretty slow. Can you precompute it or get rid of it, or replace it with something simpler?

What's real? Can it be a float?

Apart from that you'll have to turn to MMX/SSE/assembly optimisations.

AshleysBrain 2010-04-27 22:50:45

`real` is synonymous for `double` in this code. See the `typedef` at the top.

kurige 2010-04-27 23:06:52

Answer 6

A:

Thing I look for is branching, they tend to be performance killers.

You can use loop unrolling.

also, remember multiple with smaller parts of the problem :-

  for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {

is about the same as having one loop doing everything, and you can get speed ups through loop unrolling and better hitting of the cache

You could thread this to make better use of multiple cores

you have some expensive calculations that you might be able to reduce, especially if the calcs end up calculating the same thing, can use caching etc....

but really need to know where its costing you the most

Keith Nicholas 2010-04-27 22:57:20

Answer 7

+2 A:

Would you benefit from the famous "fast inverse square root" algorithm?

float InvSqrt(float x)
{
    union {
        float f;
        int i;
    } tmp;
    tmp.f = x;
    tmp.i = 0x5f3759df - (tmp.i >> 1);
    float y = tmp.f;
    return y * (1.5f - 0.5f * x * y * y);
}

It returns a reasonably accurate representation of 1/r**2 (the first iteration of Newton's method with a clever initial guess). It is used widely for computer graphics and game development.

Arrieta 2010-04-27 23:38:29

This is not an optimization on almost any current architecture. If you need a fast approximate inverse square root, use a dedicated hardware instruction to get it. (`rsqrtss` on x86 with SSE, `vrsqrte.f32` on ARM with NEON, `vrsqrtefp` on PowerPC with AltiVec). They're all both faster and more accurate.

Stephen Canon 2010-04-28 17:59:09

I can verify what Stephen claims. I did some timings of the invsqrt and sqrt then divide, and the invsqrt was not only marginally faster but far more inaccurate. For a similar level of accuracy (by adding more iterations of Newton's method -- That last return step) invsqrt was slower.

Sagekilla 2010-04-29 04:42:31

Answer 8

+3 A:

If this is for visual effects, and your particle position/speed only need to be approximate, then you can try replacing sqrt with the first few terms of its respective Taylor series. The magnitude of the next unused term represents the error margin of your approximation.

Emile Cormier 2010-04-27 23:39:52

Answer 9

+3 A:

Easy thing first: move all the "old" variables to a different array. You never access them in your main loop, so you're touching twice as much memory as you actually need (and thus getting twice as many cache misses). Here's a recent blog post on the subject: http://msinilo.pl/blog/?p=614. And of course, you could prefetch a few particles ahead, e.g. p[j+k], where k is some constant that will take some experimentation.

If you move the mass out too, you could store things like this:

struct ParticleData
{
    Vector pos, vel, acc, jerk;
};

ParticleData* currentParticles = ...
ParticleData* oldParticles = ...
real* masses = ...

then updating the old particle data from the new data becomes a single big memcpy from the current particles to the old particles.

If you're willing to make the code a bit uglier, you might be able to get better SIMD optimization by storing things in "transposed" format, e.g

struct ParticleData
{
    // data_x[0] == pos.x, data_x[1] = vel.x, data_x[2] = acc.x, data_x[3] = jerk.x
    Vector4 data_x;

    // data_y[0] == pos.y, data_y[1] = vel.y, etc.
    Vector4 data_y;

    // data_z[0] == pos.z, data_y[1] = vel.z, etc.
    Vector4 data_z;
};

where Vector4 is either one single-precision or two double-precision SIMD vectors. This format is common in ray tracing for testing multiple rays at once; it lets you do operations like dot products more efficiently (without shuffles), and it also means your memory loads can be 16-byte aligned. It definitely takes a few minutes to wrap your head around though :)

Hope that helps, let me know if you need a reference on using the transposed representation (although I'm not sure how much help it would actually be here either).

celion 2010-04-27 23:58:15

Thank you! That's an interesting idea, I'll have to give this a try.Although my code doesn't explicit state it, the oldData is used in my integration functions where I need to keep track of older data. I will try splitting my data and doing that memcpy.

Sagekilla 2010-04-28 05:09:15

The memcpy is a small benefit and only O(N). Reducing cache misses in the inner loop is where you'll see the main gain (I hope).

celion 2010-04-28 06:42:14

Answer 10

A:

You should re-use the reals and vectors that you always use. The cost of constructing a Vector or Real might be trivial.. but not if numParticles is very large, especially with your seemingly O((n^2)/2) loop.

Vector r;
Vector v;
real r2;
real v2;
Vector da;
Vector dj;
real r3i;
real mij;
real tau_est_q1;
real tau_est_q2;
for (size_t i = 0; i < numParticles; i++)
    {
        for (size_t j = i+1; j < numParticles; j++)
        {
            r = p[j].pos - p[i].pos;
            v = p[j].vel - p[i].vel;
            r2 = lengthsq(r);
            v2 = lengthsq(v);

            // Calculate inverse of |r|^3
            r3i = Constants::G * pow(r2, -1.5);

            // da = r / |r|^3
            // dj = (v / |r|^3 - 3 * (r . v) * r / |r|^5
            da = r * r3i;
            dj = (v - r * (3 * dot(r, v) / r2)) * r3i;

            // Calculate new acceleration and jerk
            p[i].acc += da * p[j].mass;
            p[i].jerk += dj * p[j].mass;
            p[j].acc -= da * p[i].mass;
            p[j].jerk -= dj * p[i].mass;

            // Collision estimation
            // Metric 1) tau = |r|^2 / |a(j) - a(i)|
            // Metric 2) tau = |r|^4 / |v|^4
            mij = p[i].mass + p[j].mass;
            tau_est_q1 = r2 / (lengthsq(da) * mij * mij);
            tau_est_q2 = (r2*r2) / (v2*v2);

            if (tau_est_q1 < tau_q)
                tau_q = tau_est_q1;
            if (tau_est_q2 < tau_q)
                tau_q = tau_est_q2;
        }
    }

DeadMG 2010-04-28 00:19:32

As there is no constructor and no members, that vector is a POD and as such, there shouldn't be any initialization code right? One would hope that the compiler would be smart enough to 'register' those values. But you never know...

Michael Dorgan 2010-04-28 00:25:34

yes, I'd hope the compiler would perform this. But, better safe than sorry, in my experience. It's generally good practice to re-use reusable variables anyway - if he wrote another loop that involved, say, strings, it's a good habit to get into.

DeadMG 2010-04-28 11:14:03

It's 2010. There's absolutely no need to worry about reusing variables. In fact, if you're using a modern compiler, and reuse a variable, it may even internally represent the reuse as a new variable. There's also no need to do something that makes code less readable and provides no performance benefit.

2010-04-28 18:11:43

Answer 11

+2 A:

My first advice would be to look at the molecular dynamics litterature, people in this field have considered a lot of optimizations in the field of particle systems. Have a look at GROMACS for example.

With many particles, what's killing you is of course the double for loop. I don't know how accurately you need to compute the time evolution of your system of particles but if you don't need a very accurate calculation you could simply ignore the interactions between particles that are too far apart (you have to set a cut-off distance). A very efficient way to do this is the use of neighbour lists with buffer regions to update those lists only when needed.

Adrien 2010-04-28 00:19:33

Answer 12

A:

Consider also pulling your multiplication of Constants::G out of the loop. If you can change the semantic meaning of the vectors stored so that they effectively store the actual value/G you can do the gravitation constant multiplacation as needed.

Anything that you can do to trim the size of the Particle structure will also help you to improve cache locality. You don't seem to be using the old* members here. If they can be removed that will potentially make a significant difference.

Consider splitting our particle struct into a pair of structs. Your first loop through the data to reset all of the acc and jerk values could be an efficient memset if you did this. You would then essentially have two arrays (or vectors) where part particle 'n' is stored at index 'n' of each of the arrays.

StarLight 2010-04-28 04:48:50

Answer 13

A:

You can replace any occurrence of:

a = b/c
d = e/f

with

icf = 1/(c*f)
a = bf*icf
d = ec*icf

if you know that icf isn't going to cause anything to go out of range and if your hardware can perform 3 multiplications faster than a division. It's probably not worth batching more divisions together unless you have really old hardware with really slow division.

You'll get away with fewer time steps if you use other integration schemes (eg. Runge-Kutta) but I suspect you already know that.

2010-04-28 18:07:49

Answer 14

+2 A:

All good stuff above. I've been doing similar things to a 2nd order (Leapfrog) integrator. The next two things I did after considering many of the improvements suggested above was start using SSE intrinsics to take advantage of vectorization and parallelize the code using a novel algorithm which avoids race conditions and takes advantage of cache locality.

SSE example:

http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native/LeapfrogNativeIntegratorImpl.cpp

Novel cache algorithm, explanation and example code:

http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions/

http://bitbucket.org/ademiller/nbody/src/tip/NBody.DomainModel.Native.Ppl/LeapfrogNativeParallelRecursiveIntegratorImpl.cpp

You might also find the following deck I gave at Seattle Code Camp interesting:

http://www.ademiller.com/blogs/tech/2010/04/seattle-code-camp/

Your forth order integrator is more complex and would be harder to parallelize with limited gains on a two core system but I would definitely suggest checking out SSE, I got some reasonable performance improvements here.

Ade Miller 2010-04-29 04:26:00

Thanks a lot! Those links were extremely useful. I already have another version that's parallelized where I looped n^2 (instead of (n^2 - n) / 2) times so that I could accumulate the ith particle using force of the jth particle. That version had no race conditions and no need for locking because of how I wrote it.

Sagekilla 2010-04-29 04:36:49

I just took a look at your NBody code (Are you also reading Art of Computational Science?) and I have one small recommendation: Try writing your integrators as a predict and correct step, and pull out your force calculation. Then you can write it as a PEC scheme (Predict, Evaluate, Correct).

Sagekilla 2010-04-29 04:51:44

That's how mine are written too. The novel algorithm gets you back to (n^2 - n) / 2) for parallel code and also lets you fit blocks of the problem into L1 cache reducing cache misses. This made a significant difference for me - nearly 2x. At some point cache misses become by far your biggest overhead.

Ade Miller 2010-04-29 04:55:25

Yes, I've read chucks of the Art of Comp Sci. I should really get back to it. Currently everything uses a global timestep. I should play around with that some more and move to individual timesteps. I sort of got sidetracked by the whole CUDA thing which is fast but hard to code for. I have a Hermite integrator in C#, maybe I'll play around with that. Thanks!

Ade Miller 2010-04-29 05:03:37

ansaurus

tags:

views:

answers:

Can my loop be optimized any more? (C++)

related questions