views:

374

answers:

9

Ok, I've been sitting in front of profiler results over the last three days, generated by running a pretty wide range of test cases through an automation suite. The idea is to see if there are any good optimisations that can generally improve performance. I'll qualify good in this context as follows;

  • Has the potential for performance improvement that are both very significant and observable at end user level, e.g. > 100% improvement in an underperforming area.

  • Has the potential for core space usage reduction e.g. > 50% reduction in a data heavy area.

  • Is easy to implement, with minimal obfuscation to the code, and minimal side effects. i.e. the benefits of implementing the optimisation greatly outweight the costs.

The application is a 3d mapping and modelling package with plenty of graphics in the interface and geometric processing at the back end. I've already done a lot on ensuring optimal algorithm selection for most processing, and at this stage I'm looking for any generally applicable easy ways to get that extra bit of punch when dealing with large and complex data sets. So far, I've come up with the following;

  • When searching, keep a buffer of the last recently found items and check that first. A heck of a lot of processing with repetetive searches seem to search around the same area. From answers to date this appears to be a specific form of memoization

  • When sorting, check the data isn't already in sort order (specifically where qsort has been used)

  • Keep the GUI and processing in seperate threads (Fails the good criteria of being easy to implement, but IMO still worthwhile)

  • Where you have local class variables, that have significant construction / destruction times, in heavily used member functions, make them private class members. Notably true of dynamic arrays and strings, especially MFC CArrays and CStrings.

  • When using dynamic arrays, set the initial size to slightly exceed typical usage, and have an exponential growth strategy.

  • When dealing with very large data sets to be stored in arrays, size up the data first to avoid any reallocs.

  • Avoid function returns that create temporary object copies on the stack, use reference parameters instead, e.g.

    CString MyFunc(double x, double y)

is less efficient than

void  MyFunc(double x, double y, CString &Result)

actually, avoid CStrings and most of MFC in any performance critical area of the code. (Edit: this may be more generally negated by RVO, although not for CStrings in my app)

These are items that seem to work well in my context, but are there any obvious ones that I've left out, or are there any other good resources out there on optimisation?

Edit: Based on many of the comments provided, some further explanation is clearly required. While I fully realise that suggesting specific optimisations for a specific piece of code requires sight of that code, the past couple of days spent analysing profiler output have shown up certain patterns in terms of optimisation candidates. I'm also aware of my own ignorance in respect of what others are doing to good effect in the field, and see value (to me at least) in having an enumerated list of such techniques, regardless of whether they apply to my situation. This question is not about the dangers of optimization, but for any beginners out there, I'd recommend you first establish a strong requirement for the need to optimize prior to considering in the first place. My own preference is to carry out most optimization at design stage based on performance requirements going forward, but I'm also a strong advocate of profiling to verify design assumptions have been met in implementation. I would ask people to please limit their answers to their own experience of positive optimizations rather than on their beliefs as to whether we should consider optimization in the first instance.

FWIW, the difference between having the compiler optimise the code or not comes out at 12% across my automation suite, which is borderline observable at end user level.

Second edit: Some related posts I found very beneficial, particularly Mike Dunlavey's comments on becoming too dependant on profiler output.

Performance optimization strategies of last resort?

What can I use to profile C++ code in Linux?

A: 

If you run the code under a sampling profiler and find that there are any compute-intensive routines taking a significant chunk of time then there are numerous micro-optimisations that may be considered, up to and including the use of SIMD if you are able to limit your support to specific CPU(s).

Paul R
+3  A: 

I think your requirements are pretty much mutually exclusive unless there's an obvious flaw of some kind (which is all profiling is really good for finding).

Things that really change performance require a lot of effort, and your basic data structures are the most important thing. Reducing memory fragm., aligned memory management, SIMD, data structures that are small as possible and allocated all in one block as much as possible, multithreading, reducing code size from templates, redeclaring parameters as local variables so the optimizer can know they are same to optimize. None of those can be tacked on at the end without a lot of cost.

Many times you can't even easily measure the things that really affect performance because they only become costly as the program runs or as its code size grows.

Charles Eli Cheese
wrt your final paragraph @Charles, I don't see why one can't instrument a code to measure performance as the program runs or as you vary problem sizes. I do it all the time.
High Performance Mark
The problem is you can't easily tell what effect you have when it comes to memory fragmentation and cache issues. It may seem one tree implementation is much better, but a much simpler linear octree in an array may turn out to be many times faster than a BSP in actual use once the program has run a while.
Charles Eli Cheese
Your right of course that the requirements regularly conflict, but while they may seem mutually exclusive they are often not so, as was the case with search buffering, which I have just found out from anaother answer is a specific case of memoization (http://en.wikipedia.org/wiki/Memoization) I also agree that the best optimisations will happen at the design stage, but also find profiling valuable, if only to prove design assumptions. I have to look up some more on SIMD and sample applications in C++. Any good references?
Shane MacLaughlin
With regard to your last paragraph, the profiling has been carried out in conjunction with an automated test suite that covers very many typical use cases. (My normal five hour regression tests to four days to run!)
Shane MacLaughlin
your compiler docs and intel.com are the only really helpful things, really. If you use SIMD then you may as well make all memory allocation aligned as well, whether by overriding mem allocation or compiler alignment declarations. If you don't, you have to do this stupid shuffling nonsense that is not only a big pain but removes a lot of SIMD's benefit. Aligned memory, in general, speeds up execution of arbitrary code by about 30% anyway.
Charles Eli Cheese
+9  A: 

please don't throw up any of the old 'optimisation is the root of all evil' stuff, it's totally irrelevent to this question

Yes, and then you have things like:

When sorting, check the data isn't already in sort order

makes me wonder if you're using an efficient algorithm or not. Which, is the basic premise that "premature optimization is the root of all evil" states.

and will be downvoted.

Really? That tone is no good. IMO. YMMV.

Similarly, I'm not interested in the joys of letting the compiler optimise it for me, or throwing inlines all over the place. FWIW, the difference between having the compiler optimise the code or not comes out at 12% across my automation suite, which is borderline observable at end user level.

Coupled with any optimization you hand-instrument, you'd still want the compiler optimizations on.

Apart from that, since you provide no particular insights on where your bottlenecks are, it is difficult if not impossible to provide any pointers. I'd hazard a guess that you at the least:

  • Play around with custom allocators
  • Explore the possibility of using vector instructions if your target machine is a vector one

Edit: Since you say you were not aware of RVO: try reading up on move semantics and particularly this library: move library from Adobe. I guess Boost will have something similar.

Edit#2: Two more things:

  • Fixed point arithmetic may be a good thing to look up
  • Use Look-up tables as much as you can
dirkgently
+1 "wonder if you're using an efficient algorithm", "it is difficult if not impossible to provide any pointers": Hear hear.
Martin B
@Martin B: A bit lost on the sarcasm?
dirkgently
The idea behind the question was to tease out optimisation techniques that people are using that they have found benficial. The 'optimisation is the root of all evil' argument has already been done to death on this site. Yes, I use compiler optimization as well, but it doesn't relate to the question. You are right of course, that it is difficult to provide general solutions that will work in many scenarios, which is why I provided explicit examples and described the context.
Shane MacLaughlin
Apparently yes -- do you mean the statements I quoted were meant ironically...?
Martin B
@dirkgently - quicksort is a widely used algorithm and generally considered efficient, however when provided with pre-sorted data it becomes massively inefficient. See http://profesores.elo.utfsm.cl/~agv/elo320/PLDS210/qsort2.html At the same time qsort is the default sort provided in the standard C library, and works well when pathological cases, such as pre-sorted data, are avoided.
Shane MacLaughlin
@Shane MacLaughlin: I'd guessed as much -- `quicksort` is what you're using. Please note there are a few partitioning techniques to speed them up. Have you tried any of those? Maybe, if required roll your own, iterative `qsort` if the standard one is slow (doesn't allow such efficient partitioning techniques).
dirkgently
@Shane MacLaughlin: You might be the one asking this question, but there'll be plenty more reading the answers. So, it always helps, before we lead the unsuspecting reader to try out stuff that really, really should be the last resort.
dirkgently
@Martin B: Yes, or so I thought.
dirkgently
@Shane: I use dozens, if not hundreds, of "optimization techniques" that I find beneficial. The problem is I have absolutely no clue which ones of them apply to *you*, because you didn't tell us anything about your code. Psychic debugging is hard, but psychic optimization is just plain impossible. Almost every optimization becomes a pessimisation in some cases. It depends 120% on the context in which it is applied, and you've given us no context.
jalf
Shane MacLaughlin
@dirkgently and @jalf: edited the tone of the question to make it appear less inflmmatory, warn off newbies, and better explain what I'm after. As an aside, do you ever wonder how many of those that roll out the Knuth/Hoare quote have actually read Knuth or Hoare?
Shane MacLaughlin
+3  A: 

Hi

I fear large and complex data structures. I like large and simple data structures; in particular arrays. When I have large and simple data structures I can try to do clever things with memory access to really optimise my use of the cache; in particular memory-tiling. Not sure if this is any use to you, but in general, given your set of requirements and existing understanding of your code, I'd be looking for ways to optimise the getting of data from RAM to CPU.

And, I would, of course, be parallelising the hell out of it all. Not possible unless you have a multi-computer. Oh, memo just in, we've all got those these days !!

Good luck and do let us know how you get on. I read a lot of crap on SO about what should be the best way to optimise this bit of code or that bit, very little hard evidence that anyone ever measures anything as you seem to have done.

Heck, I like your question so much I'm upvoting it.

Regards

Mark

High Performance Mark
I don't really get your last paragraph. He says he's measured his code, *and then asks for optimization tips for code in general*, thus making absolutely no *use* of his measurements. You criticize optimization advice without hard evidence to back it up, but it is *impossible* to answer anything to this question with hard evidence, because we haven't seen a line of code. NO matter what we suggest, we have no way of determining if it'll actually help in his case.
jalf
++ For your comment about fearing large and complex data structures. In my experience, most problems are caused by designers being data-happy - glorying in a lovely class hierarchy, chock full of notification-style maintenance of redundant data.
Mike Dunlavey
+5  A: 

CString MyFunc(double x, double y)

is less efficient than

void MyFunc(double x, double y, CString &Result)

If MyFunc is written cleanly they should be about the same. The compiler should be able to take advantage of NRVO. It sounds like you've profiled and found that its not - I'm just saying it may more fit your criteria, e.g. "minimal obfuscation to the code", to rearrange the function itself a little to allow NRVO to occur.

A few more things to try:

  1. memoization (similar to your caching of repeated searches, but more focussed on tree/ graph resolution, if you have any).
  2. Using floats instead of doubles, if you don't need the extra precision (or maybe even ints if you can).
  3. Use types to mark assumptions (you mentioned sorted arrays - another common one is lowercased strings). Create a derived or wrapper type (e.g. Sorted<T>) that make such assumptions explicit. That way if you have a method that takes a Sorted<vector<T> >, for example, and you give it a sorted vector, it passes straight through - but if you give it a vector<T>, it will have to constructed a Sorted<vector<T> >, at which point it will sort it. You can manually assert that it is sorted using an alternative constructor, but it makes it much easier to carry your assumptions around and maybe catch places you might have otherwise missed.
  4. Don't give up on inlines etc. Make sure you're fully aware of when they should help and when they should hinder. In some cases they really can make a difference - but probably not if you just deal with them arbitrarily.
  5. You might benefit from flyweight, or pooled object allocators.
  6. When threading, try to minimise any interactions so you can reduce the amount of code that requires locking. Often taking copies of even fairly expensive objects can be less of an overhead than a mutex. Obvious take advantage of atomic types where you can.
Phil Nash
Thanks Phil, just the types of things I was looking for. Just read the Wikipedia article on memoization http://en.wikipedia.org/wiki/Memoization, any other good references?
Shane MacLaughlin
It's more commonly associated with functional programming, so you'll probably find more references in the FP literature. I don't have any ready refs to hand, I'm afraid. The wikipedia article should be enough to give you the idea, though.
Phil Nash
+3  A: 

Start by optimizing your own time. Don't bother trying to list and/or apply optimizations blindly. Don't waste your time converting return values to reference parameters just because you don't trust the compiler to do NRVO.

Don't waste your time manually marking functions as inline. Don't waste your time trying to collect some kind of universal "Dictionary of Optimizations".

97% of your code just doesn't matter performance-wise. If you try to apply optimizations regardless of what they do and where they're useful, you're going to waste 97% of your time. All that time could have been spent optimizing the 3% of the code that actually mattered. (Incidentally, this is what Knuth actually meant with the "root of all evil" quote. Not that optimizations shouldn't be performed, but that unless you have a specific piece of code in mind which you already know is a hotspot, your optimizations will be 1) premature, and 2) ineffective)

So first optimization advice: Close this question and instead ask for help in optimizing the specific code that matters in your app. You're not going to learn anything useful about optimizing the 3% of your app that matter by asking for general optimization tricks.

Second optimization advice (assuming that you're looking at a specific hotspot right now and that you've done all you can in terms of choosing the right algorithm and parallelizing and other high-level optimizations): Look at the assembly output from your compiler.

Third optimization advice: Understand the system you're running on. Structure your code to exploit spatial and temporal locality, to minimize cache misses. Simply switching traversal of a 2D array from column-major to row-major order can easily double performance. Keep in mind that the compiler and CPU will both reorder instructions to improve throughput, but branches limit their ability to do this, so try to structure your code to get reasonably large basic blocks with no jumps into or out of them. If you're running on a CPU that supports SIMD instructions, consider whether they can be used efficiently. If you have to really dive into instruction-level optimizations, make sure you have a grasp on the latencies of the instructions used. For floating-point heavy code, keep in mind that FP instructions in general will not be automatically reordered by the compiler or CPU. Because FP operations have fairly long latencies, dependency chains can be a real performance killer. Breaking those up manually can speed your code up significantly. Similarly, avoid memory dependencies. Code that first writes, then reads an address is going to be slow. If one or both memory access can't be optimized away, then you have to wait for the write to complete (which could otherwise happen in the background without stalling the CPU), before starting the read. Place all frequently used data in temporaries, to avoid aliasing.

As for optimizing "large complex datasets" as you asked? I have absolutely no clue. The thing about complex datasets is that they have very little in common. There is no general way to optimize them.

A final suggestion: It sounds like you're not really writing C++. You're talking about manually implementing dynamic arrays, you're talking about reallocs and MFC classes. It sounds like you're writing "C with classes". Use the standard library.std::vector already exists. So does std::string. The C++ standard library has the nice property that it's extremely efficient. The same can't be said for MFC classes.

jalf
With respect, I don't believe for a moment that there are no widely applicable techniques for optimization. Surely this is the basis of studying algorithms and complexity theory. Secondly, in terms of unsupportable generalisations, what puts you in a position to comment about 97% of anyones code that you haven't seen. And lastly, are you really knocking C++ with classes while at the same time suggesting I revert to assembler . Sure, I could refactor endless amounts of legacy code from MFC to standard C++, but I'd need slightly more concrete proof that I'd see a return on that investment.
Shane MacLaughlin
+1  A: 

+1 Good question.

rant I think you were right to request people not to bleat about premature optimisation - all too often this old chestnut is wheeled out as an excuse for lazy design or sloppy implementation. There is such a thing as gratuitous under-optimisation, often caused by poor algorithm selection. end rant

I don't know where this 97% thing comes from. I was taught the 80/20 rule - 20% of the code runs 80% of the time, which interestingly also seems to apply to other things besides software. Anyway...

First port of call is always the algorithm - ensure you are using an efficient algorithm. An unoptimised good algorithm will almost always beat a highly optimised bad algorithm. I suspect you already know this.

A few general optimisations I've found can be useful:

  1. Linear searches are often responsible for quite a lot of overhead. Often these can be replaced by binary searching a sorted collection.

  2. Care is needed when sorting. Although quicksort is a great general purpose sorting algorithm, sometimes bubble sort or other kinds of sorting are faster when your set is already sorted or partially sorted.

  3. Inserting into a sorted set - could use a binary search to find the right place instead of the naive (although often good enough) implementation of "stick it on the end, then sort".

  4. Separating out your keys from your values should help make searching keys faster by utilising the cache more efficiently.

  5. Use a double buffer and swap when two threads are interacting as supplier/consumer to minimise the time the buffer needs to be locked. It's often faster for threads to work on separate copies of data and then fix up later rather than have them trip over each others locks.

  6. Don't rely on the compiler to optimise away constructors and destructors inside tight loops - either move objects outside the loop and reuse or confirm that the compiler has optimised it as you would wish.

markh44
+2  A: 

I sympathize with your position, especially

I've been sitting in front of profiler results over the last three days

I assume you've done a good job of coding the app intelligently. Now, here's what I suggest:

  • do not look for "any good optimisations that can generally improve performance" that you just try by guesswork. These will be obvious when you know what's taking time.

  • do have a better way of finding out what's taking time than staring at profiler output.
    This is how I do it.

My experience says you have lots more room for speedup. This is my canonical example.

... and good luck. Let us know how it works out.

Mike Dunlavey
Great linked posts. wrt your first point, I'm not looking at trying out optimizations by guess work. I'm more interested in seeing what others have done, and in what contexts, to get some ideas for future optimizations. Your example was very close to my experience, particularly in terms of diminishing returns, i.e. 82secs- > 72sec -> 76secs -> 24 secs -> 18secs -> 16secs -> stop. Note the mis-optimisation mid way, I find that happens some times. To get from 16secs to say 3-4secs would require a major rethink / redesign but a factor of 5 is still good for now.
Shane MacLaughlin
@Shane: It sounds like your'e doing the right thing. What I meant by "guesswork" is a theme others have also said. Many people just try things blindly and wonder why they don't help much. I like this whole question you started because we're hearing some real expertise bandied about.
Mike Dunlavey
+3  A: 

I'm surprised nobody seems to have responded to one point. In your question, you mention using qsort(). In quite a few cases, you can get quite a substantial speed improvement by using std::sort() instead of qsort(). The size of improvement generally depends on how complex your comparison function is, but I've found 2:1 improvement fairly typical, and 5:1 or even 10:1 is sometimes possible.

Jerry Coffin