views:

1370

answers:

21
+28  Q: 

Speedup C++ code

Hi,

I am writing a C++ number crunching application, where the bottleneck is a function that has to calculate for double:

 template<class T> inline T sqr(const T& x){return x*x;}

and another one that calculates

Base   dist2(const Point& p) const
       { return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z); }

These operations take 80% of the computation time. I wonder if you can suggest approaches to make it faster, even if there is some sort of accuracy loss

Thanks

+13  A: 

First, make sure dist2 can be inlined (it's not clear from your post whether or not this is the case), having it defined in a header file if necessary (generally you'll need to do this - but if your compiler generates code at link time, then that's not necessarily the case).

Assuming x86 architecture, be sure to allow your compiler to generate code using SSE2 instructions (an example of an SIMD instruction set) if they are available on the target architecture. To give the compiler the best opportunity to optimize these, you can try to batch your sqr operations together (SSE2 instructions should be able to do up to 4 float or 2 double operations at a time depending on the instruction.. but of course it can only do this if you have the inputs to more than one operation on the ready). I wouldn't be too optimistic about the compiler's ability to figure out that it can batch them.. but you can at least set up your code so that it would be possible in theory.

If you're still not satisfied with the speed and you don't trust that your compiler is doing it best, you should look into using compiler intrinsics which will allow you to write potential parallel instructions explicitly.. or alternatively, you can go right ahead and write architecture-specific assembly code to take advantage of SSE2 or whichever instructions are most appropriate on your architecture. (Warning: if you hand-code the assembly, either take extra care that it still gets inlined, or make it into a large batch operation)

To take it even further, (and as glowcoder has already mentioned) you could perform these operations on a GPU. For your specific case, bear in mind that GPU's often don't support double precision floating point.. though if it's a good fit for what you're doing, you'll get orders of magnitude better performance this way. Google for GPGPU or whatnot and see what's best for you.

guesser
Well, too bad. You covered almost all suggestions that I'd give. If a GPU is there to be used, I stress that using it would be the right aproach. Libraries such as DirectX and OGL have methods to calculate the distance between two points.
Bruno Brant
Wouldn't moving the data to the gpu have a high overhead? This would require moving more of the computation to the gpu to offset that overhead.
josefx
@josefx: I haven't heard anyone voice those concerns lately. I'm not sure how high the overhead of that may be. Since 80% of the time is spent in two simple operations (and so he presumably won't need to do much with those numbers on the GPU -- though we are missing some important details), you may make a good point. If there's ever a time when that overhead is an issue, this would likely be one of them.
guesser
+7  A: 

I think optimising these functions might be difficult, you might be better off optimising the code that calls these functions to call them less, or to do things differently.

Salgar
The function might not be able to be optimized, but utilizing graphical capabilities would speed up the overall execution time. See below.
glowcoder
+2  A: 

Floating point operations are quite often slower, maybe you can think about modifying the code to use only integer arithmetic and see if this helps?

EDIT: After the point made by Paul R I reworded my advice not to claim that floating point operations are always slower. Thanks.

Krystian
Actually, that's a good one, I think. Use fixed point, scaled longs where 1000L = 1.000 for example.
paxdiablo
Why the downvote, bods?
paxdiablo
This is an *old skool* optimisation that is mostly invalid these days. On modern superscalar CPUs floating point is not slower than integer - if anything it may well be faster, since it frees up the integer units for address calculations and general housekeeping. A lot of modern CPUs have 2 FPUs giving two FP instruction issues per clock.
Paul R
@Paul R: Of course everything depends on the architecture and I believe you can find CPUs for which floating point mul is no slower than integer mul. But for example for Core 2 (which is a fairly common architecture), you can look up in http://www.agner.org/optimize/instruction_tables.pdf that MUL reg32, reg32 has latency 3 and throughput 1 while FMUL reg has latency 5 and throughput 1/2.The difference between SUB and FSUB is even more significant: SUB: Latency 1 and throughput 3 while FSUB: latency 3 and throughput 1.
Krystian
@Krystian: yeah, but... MULPS has throughput 1 and... Latency 4 ! In other words, throw in some SIMD and the rules change.
RaphaelSP
@Paul R: The throughput for FP ops is very good these days, yes, but latency is still an issue, so in some cases, FP ops are really a lot slower.
jalf
@RaphaelSP: Yes, of course, SIMD changes everything. But I'm not sure if a compiler would be smart enough to take advantage of SIMD when compiling functions like inline T sqr(const double }
Krystian
@jalf: for an operation such as the above example it should be trivial to hide any FP instruction latency - we're only talking about add/subtract/multiply, after all.
Paul R
@Krystian: in general latency can be buried quite easily (except when it gets very large, as in the case of say a divide instruction) - the important factor to consider is issue rate (i.e. throughput). If you can issue 2 FP multiplies per clock and only one integer multiply then FP will almost certainly win by a large margin. (And don't forget that you need integer instructions in your loop for address calculations and loop counters etc).
Paul R
@Paul R: Those instructions have significant latency too (often 3 or 4 cycles). What matters is not what the instructions being executed are, but the *context* in which they're executed. Is the result needed immediately, or are there non-dependent instructions that can be executed during those 3-4 cycles while waiting for the result to be available. We don't know for sure, given the fractions of code we've seen, but it's definitely something to look out for when optimizing.
jalf
And no, in the real world, the latency can often not be hidden easily. That's why you very rarely achieve more than 30-50% utilization of the CPU's FP units.
jalf
@jalf: there are various techniques for hiding latency - some can be handled by the compiler and then there are further techniques which can be applied in the code itself to reduce the pressure on the instruction scheduler. It's not always possible of course, but it depends how important it is to approach the theoretical maximum throughput for the CPU, which can usually be achieved if you are prepared to throw enough effort at it.
Paul R
With FP ops, there is little the compiler can do, as reordering dependant operations may affect rounding errors. But no, for FP-heavy code, it is often impossible to get anywhere near maximum throughput, no matter how much effort you throw at it (and yet that's my point: there is a potential problem at which effort has to be thrown). But for the code we've seen, latency is an issue. Both functions are very inefficient in isolation. They rely on there being unrelated instructions around them which can be used to camouflage the latency.
jalf
+5  A: 

If sqr() is being used only on primitive types, you might try taking the argument by value instead of reference. That would save you an indirection.

Fred Larson
It's inlined, so I don't think that matters.
Seth Johnson
@Seth, keep in mind that inline is a _suggestion_ to the compiler. If you want to be certain, revert back to the evil macros of my beloved C :-)
paxdiablo
@Seth Johnson: You might be right. I suppose an actual indirection wouldn't be necessary if it's inlined. And I can't see any reason the compiler would refuse to inline this function. Still, it might be interesting to see if it makes a difference.
Fred Larson
@paxdiablo: Yeah, make it a macro and then try calling it like `SQR(x++)`. ;v)
Fred Larson
That's only a problem for people that don't know what they're doing, @Fred. If you don't know how to handle macros safely, you can always take up VB :-) [ not you specifically, I meant in the general sense ]. I'd at least examine the executable output to make sure it _is_ being inlined.
paxdiablo
When a function's parameters are passed by value the optimizer is allowed to make more assumptions and changes than it is with const ref parameters. So in theory a `sqr()` function that takes its arguments by value could be better optimized. I don't know if it will matter as the function is so small. Never optimize with out measurement, so try switching the parameter to a value see if it's faster.
caspin
or use `__forceinline` (MSVC) or `__attribute__((always_inline))` (gcc) to force the compiler to inline the function
KitsuneYMG
Rather than declaring the `sqr` function as inlined, just use `x * x` in the calculation. This eliminates any controversy about function calls.
Thomas Matthews
A: 

(scratch that!!! sqr != sqrt )

See if the "Fast sqrt" is applicable in your case :

http://en.wikipedia.org/wiki/Fast_inverse_square_root

Max
He's not doing a square root.
Paul R
oopps!!! sqr != sqrt!!
Max
Admittedly, `sqr` is an extremely bad name for the function. Using slightly longer and more readable names (like `square`, or even just `pow2`) doesn't slow down the code.
UncleBens
+7  A: 

You don't say whether the calls to dist2 can be parallelised or not. If they can, then you could build a thread pool and split this work up into smaller chunks per thread.

What does your profiler tell you is happening inside dist2. Are you actually using 100% CPU all the time or are you cache missing and waiting for data to load?

To be honest, we really need more details to give you a definitive answer.

Glen
If the process is pinning the CPU, then adding threads won't help (unless you can use multiple CPUs).
Stephen
@Stephen, obviously. However the OP doesn't give us enough info here. Most boxes these days have more then one CPU (at least those used for "number crunching application", so it's a reasonable guess
Glen
@Glen: Yep, I figured you knew this... but no harm in being explicit.
Stephen
yes, also a good point, they can be parallel
Werner
+4  A: 

If you have a number of these to do, and you're doing graphics or "graphic like" tasks (thermal modeling, almost any 3d modeling) you might consider using OpenGL and offloading the tasks to a GPU. This would allow the computations to run in parallel, with highly optimized operational capacity. After all, you would expect something like distance or distancesq to have its own opcode on a GPU.

A researcher at a local univeristy offload almost all of his 3d-calculations for AI work to the GPU and achieved much faster results.

glowcoder
+3  A: 

If you really need all the dist2 values, then you have to compute them. It's already low level and cannot imagine speedups apart from distributing on multiple cores.

On the other side, if you're searching for closeness, then you can supply to the dist2() function your current miminum value. This way, if sqr(x-p.x) is already larger than your current minimum, you can avoid computing the remaining 2 squares.

Furthermore, you can avoid the first square by going deeper in the double representation. Comparing directly on the exponent value with your current miminum can save even more cycles.

Didier Trosset
+5  A: 

If you can organise your data suitably then you may well be able to use SIMD optimisation here. For an efficient implementation you would probably want to pad your Point struct so that it has 4 elements (i.e. add a fourth dummy element for padding).

Paul R
Making use of SIMD is a good idea, though it's misleading to say that there's a need to pad a struct to 4 elements.
guesser
or do 2/4 points at a time
KitsuneYMG
this seems to be a very good point, I am going to give it a try
Werner
@Werner - If the SIMD instruction set is not a standard feature of your base target architecture, you will still need to tell your compiler to make use of it. Eg. For Intel today, this would generally be SSE2.
guesser
@guesser: the implementation will be a lot more efficient if you can map one Point to one SIMD vector - I don't see what's misleading about that ?
Paul R
@kts: yes, it depends on your data organisation, i.e. whether it's AoS or SoA, and whether there is any flexibility to change it to make it more SIMD-friendly.
Paul R
Definitely do 4 points at a time, if possible (if your algorithm is possible to do as data parallel). Getting speed-up from vector operations or normal data structures is very troublesome - the code sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z) is basically a dot product, which requires summing across the components. This cannot be done with SSE or SSE2, you need SSE3 for that.
Suma
@PaulR: If I envision coding this in assembly, it seems irrelevant. We can see that he needs to do subtractions such as x-p.x first. Then, for example, SSE2's double mul instruction will do two multiplications (4 inputs.. a*a, b*b.. (NOT 4 muls)) in parallel. It would be better to pack all the doubles in one big array, so that the third coord of one point can be sqr'd in the same op as the first coord of the next, etc. Alignment in general could be important, but that is a different issue.
guesser
@Suma: it's not clear that Werner is using x86, but if so, then yes, you might need SSE3 for the HADDPS instruction. SSE3 has been around for a quite a few years now though, so unless he is targetting very old x86 platforms then it can probably be assumed that it's available.
Paul R
If you can apply data parallelism, you may reasonably expect to achieve speedup 2x-4x. By using Point as one SIMD data, you will most likely get something like 1.1x-1.5x. See also http://stackoverflow.com/questions/115291/how-much-speed-up-from-converting-3d-maths-to-sse-or-other-simd/115293#115293. Note: the answer was downvoted a lot, but I am quite confident in this particular case you describe here it applies very well.
Suma
@guesser: you're assuming X86, SSE2, double precision and also assembly language, none of which is a given. It it *is* going to be x86 then SSE3 is a reasonable assumption, and using C intrinsics rather than assembly is a good approach. If there is an array of Point structs then it's going to be a lot more efficient to process these if they are (a) 16 byte aligned and (b) can be made 16 bytes in size (i.e. 4 floats ideally, or 2 doubles if the extra precision is required).
Paul R
@Suma: with modern x86 CPUs (Core 2 Duo and later) SSE is a full 128 bit implementation (not a 64 bit kludge as with earlier CPUs) and SIMD performance is typically at least 2x better than in the bad old days as a result. With Core i7 you have multiple SSE execution units and performance is even better yet.
Paul R
@PaulR: Alright, seems you've got some experience with this, so I'll trust you on it. Until I see it myself a few times, I'd personally try some other things without the packing too.. and keep whatever works best.
guesser
+9  A: 

What is Base?

Is it a class with a non-explicit constructor? It's possible that you're creating a fair amount of temporary Base objects. That could be a big CPU hog.

template<class T> inline T sqr(const T& x){return x*x;}
Base   dist2(const Point& p) const {
  return sqr(x-p.x) + sqr(y-p.y) + sqr(z-p.z);
}

If p's member variables are of type Base, you could be calling sqr on Base objects, which will be creating temporaries for the subtracted coordinates, in sqr, and then for each added component.

(We can't tell without the class definitions)

You could probably speed it up by forcing the sqr calls to be on primitves and not using Base until you get to the return type of dist2.

Other performance improvement opportunities are to:

  • Use non-floating point operations, if you're ok with less precision.
  • Use algorithms which don't need to call dist2 so much, possibly caching or using the transitive property.
  • (this is probably obvious, but) Make sure you're compiling with optimization turned on.
Stephen
yes, you are right, I have to check and debug Base
Werner
excellent - note that some style guides (Google included) require all single-argument constructors be marked explicit partially for this reason ( http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml?showone=Explicit_Constructors#Explicit_Constructors )
Stephen
+2  A: 

From an operation count, I don't see how this can be sped up without delving into hardware optimizations (like SSE) as others have pointed out. An alternative is to use a different norm, like the 1-norm is just the sum of the absolute values of the terms. Then no multiplications are necessary. However, this changes the underlying geometry of your space by rearranging the apparent spacing of the objects, but it may not matter for your application.

rcollyer
+3  A: 

Are you using Visual Studio? If so you may want to look at specifying the floating point unit control using /fp fast as a compile switch. Have a look at The fp:fast Mode for Floating-Point Semantics. GCC has a host of -fOPTION floating point optimisations you might want to consider (if, as you say, accuracy is not a huge concern).

TheJuice
If using Visual Studio then one quick and easy optimisation is to switch to a decent compiler, e.g. Intel's ICC.
Paul R
@Paul R: :) but my comment stands equally for ICC: http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/win/compiler_c/fpops/common/fpops_fp_model.htm#fpops_fp_model
TheJuice
A: 

Look at the context. There's nothing you can do to optimize an operation as simple as x*x. Instead you should look at a higher level: where is the function called from? How often? Why? Can you reduce the number of calls? Can you use SIMD instructions to perform the multiplication on multiple elements at a time?

Can you perhaps offload entire parts of the algorithm to the GPU?

Is the function defined so that it can be inlined? (basically, is its definition visible at the call sites)

Is the result needed immediately after the computation? If so, the latency of FP operations might hurt you. Try to arrange your code so dependency chains are broken up or interleaved with unrelated instructions.

And of course, examine the generated assembly and see if it's what you expect.

jalf
+3  A: 

I suggest two techniques:

  1. Move the structure members into local variables at the beginning.
  2. Perform like operations together.

These techniques may not make a difference, but they are worth trying. Before making any changes, print the assembly language first. This will give you a baseline for comparison.

Here's the code:

Base   dist2(const Point& p) const
{
    //  Load the cache with data values.
    register x1 = p.x;
    register y1 = p.y;
    register z1 = p.z;

    // Perform subtraction together
    x1 = x - x1;
    y1 = y - y1;
    z1 = z - z2;

    // Perform multiplication together
    x1 *= x1;
    y1 *= y1;
    z1 *= z1;

    // Perform final sum
    x1 += y1;
    x1 += z1;

    // Return the final value
    return x1;
}

The other alternative is to group by dimension. For example, perform all 'X' operations first, then Y and followed by Z. This may show the compiler that pieces are independent and it can delegate to another core or processor.

If you can't get any more performance out of this function, you should look elsewhere as other people have suggested. Also read up on Data Driven Design. There are examples where reorganizing the loading of data can speed up performance over 25%.

Also, you may want to investigate using other processors in the system. For example, the BOINC Project can delegate calculations to a graphics processor.

Hope this helps.

Thomas Matthews
register is useless on modern compilers/platforms, similar for local vs. member. With a decent compiler the generated code should be effectively identical.
peterchen
A: 

Is there a reason you are implementing your own sqr operator?

Have you tried the one in libm it should be highly optimized.

Martin York
+1  A: 

Your best hope is to double-check that every dist2 call is actually needed: maybe the algorithm that calls it can be refactored to be more efficient? If some distances are computed multiple times, maybe they can be cached?

If you're sure all of the calls are necessary, you may be able to squeeze out a last drop of performance by using an architecture-aware compiler. I've had good results using Intel's compiler on x86s, for instance.

+2  A: 

There are a lot of answers mentioning SSE already… but since nobody has mentioned how to use it, I'll throw another in…

Your code has most everything a vectorizer needs to work, except two constraints: aliasing and alignment.

  • Aliasing is the problem of two names referring two the same object. For example, my_point.dist2( my_point ) would operate on two copies of my_point. This messes with the vectorizer.

    C99 defines the keyword restrict for pointers to specify that the referenced object is referenced uniquely: there will be no other restrict pointer to that object in the current scope. Most decent C++ compilers implement C99 as well, and import this feature somehow.

    • GCC calls it __restrict__. It may be applied to references or this.
    • MSVC calls it __restrict. I'd be surprised if support were any different from GCC.

    (It is not in C++0x, though.)

    #ifdef __GCC__
    #define restrict __restrict__
    #elif defined _MSC_VER
    #define restrict __restrict
    #endif
     
    Base   dist2(const Point& restrict p) const restrict
    
  • Most SIMD units require alignment to the size of the vector. C++ and C99 leave alignment implementation-defined, but C++0x wins this race by introducing [[align(16)]]. As that's still a bit in the future, you probably want your compiler's semi-portable support, a la restrict:

    #ifdef __GCC__
    #define align16 __attribute__((aligned (16)))
    #elif defined _MSC_VER
    #define align16 __declspec(align (16))
    #endif
     
    struct Point {
        double align16 xyz[ 3 ]; // separate x,y,z might work; dunno
        …
    };
    

This isn't guaranteed to produce results; both GCC and MSVC implement helpful feedback to tell you what wasn't vectorized and why. Google your vectorizer to learn more.

Potatoswatter
+1  A: 

Just a few thoughts, however unlikely that I will add anything of value after 18 answers :)

If you are spending 80% time in these two functions I can imagine two typical scenarios:

Your algorithm is at least polynomial
As your data seem to be spatial maybe you can bring the O(n) down by introducing spatial indexes?

You are looping over certain set
If this set comes either from data on disk (sorted?) or from loop there might be possibility to cache, or use previous computations to calculate sqrt faster.

Also regarding the cache, you should define the required precision (and the input range) - maybe some sort of lookup/cache can be used?

Unreason
A: 

The first thing that occurs to me is memoization ( on-the-fly caching of function calls ), but both sqr and dist2 it would seem like they are too low level for the overhead associated with memoization to be made up for in savings due to memoization. However at a higher level, you may find it may work well for you.

I think a more detailed analysis of you data is called for. Saying that most of the time in the program is spent executing MOV and JUMp commands may be accurate, but it's not going to help yhou optimise much. The information is too low level. For example, if you know that integer arguments are good enough for dist2, and the values are between 0 and 9, then a pre-cached tabled would be 1000 elements--not to big. You can always use code to generate it.

Have you unrolled loops? Broken down matrix opration? Looked for places where you can get by with table lookup instead of actual calculation.

Most drastic would be to adopt the techniques described in: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.8660&amp;rep=rep1&amp;type=pdf though it is admittedly a hard read and you should get some help from someone who knows Common Lisp if you don't.

HandyGandy
A: 

I'm curious why you made this a template when you said the computation is done using doubles? Why not write a standard method, function, or just 'x * x' ?

If your inputs can be predictably constrained and you really need speed create an array that contains all the outputs your function can produce. Use the input as the index into the array (A sparse hash). A function evaluation then becomes a comparison (to test for array bounds), an addition, and a memory reference. It won't get a lot faster than that.

Jay
A: 

See the SUBPD, MULPD and DPPD instructions. (DPPD required SSE4)

Depends on your code, but in some cases a stucture-of-arrays layout might be more friendly to vectorization than an array-of-structures layout.

Gautham Ganapathy