views:

1909

answers:

15

Is there a site that provides a great list of common C++ optimization practices?

What I mean by optimization is that you have to modify the source code to be able to run a program faster, not changing the compiler settings.

+4  A: 

I don't have a site off the top of my head but the book "Exceptional C++" by Sutter is superb for C/C++ development. I highly recommend every C++ programmer read this book as it gives great insight in to not only optimization but smart usage of the language so that you will program truly exceptional.

j0rd4n
Sutter spends a great deal of time on micro-optimizations, like not doing unintentional copying of your data structures when you pass them around in code. But you're unlikely to get large gains from these improvements either. Better algorithms (including using the standard library) are the big win.
Brandon Corfman
Ironically enough, the standard library is better in part *because* it uses the optimizations Sutter speaks about.
Tom
Yes, once you have the algorithms right, then you can micro-optimize.
Brandon Corfman
+11  A: 

Two ways to write better programs:

Make best use of language

  1. Code Complete by Steve McConnell
  2. Effective C++
  3. Exceptional C++

profile your application

  1. Identify what areas of code are taking how much time
  2. See if you can use better data structures/ algorithms to make things faster

There is not much language specific optimization one can do - it is limited to using language constructs (learn from #1). The main benefit comes from #2 above.

Sesh
I have Code Complete, at least the version from 1993, and while it is chock full of great solid programmer advice, I wouldn't say it "Makes Best Use of Language" (C++). It is fairly language neutral.
Brian Neal
Yup you are right. I just meant to say it helps you avoid common pitfalls.
Sesh
Efficient C++ is good book too
pm100
I would have put profiling first. Profiling is always a surprise; "OMG - what is it doing there?". Remember that something is always taking all your CPU time. You should get to the point where the top 10 things in the list feel like they make sense. Make sure you have repeatable / measurable test cases.
pm100
A: 

The best optimization one can get is by revisiting the design, and after profiling the performance relevant parts/algorithms of the application. This is usually not language specific.

What I mean is (just as an idea) if you would get 30% performance improvement by selecting a slightly better algorithm (or collection/container class) the improvement you can expect from a C++ related refactoring would be at most 2%. A design improvement could give you anything above 30%.

If you have a concrete application, the best strategy is to measure and profile the application. Profiling gives usually the most instant idea of which parts are performance relevant.

siddhadev
Proper language-level optimizations will give you far more than 2% improvement.
Tom
don't take the 2% or the 30% as an absolute value, it was meant as a comparison
siddhadev
+1  A: 

Most optimizations are language-agnostic. Understand your code, and understand the hardware you're running on, and you can do most low-level optimizations.

Understand your problem domain, and suitable algorithms, and you can do any high-level optimizations.

The only C++-specific optimization advice I can think of is "understand what your code means". Understand when C++ copies temporaries around, understand which constructors and destructors are called when.

And prefer functors to function pointers, as the former can be inlined by the compiler. In general, move as much as possible to compile-time rather than runtime. Use templates to do the heavy lifting.

And of course, don't try optimizing until you've profiled and fount out that 1) optimization is necessary, and 2) what needs optimizing.

Edit: A comment asked about functors vs function pointers being inlined. Here's the explanation:

Functions are usually compiled in isolation. So what does the compiler know about a function F that takes a function pointer FP as argument? Nothing, it has to go look up where the F is called from, and maybe there it can find a definite clue as to which function FP points to. If it can determine that when called from here, FP will ALWAYS point to function G, then yes, it can make an inlined version of F, with G inlined inside it, for this particular call site. But it can't simply inline G without inlining F, because F might be called from other places as well, where it is passed a different function pointer. And even then, it requires some expensive global optimizations to determine that anything can be inlined.

Imagine you pass a functor such as this instead:

struct Ftor {
  void operator()() { ... }
};

so the function F looks like this:

void F(const FTor& ft) {
  ...
  ft();
  ...
}

now the compiler knows exacty which function gets called: Line 2 in the function calls Ftor::operator(). And so the call can be inlined easily.

Of course, in practice, you'd typically template it, so the function can be called with any functor type:

template <typename F>
void F(const F& ft) {
  ...
  ft();
  ...
}
jalf
Why would a compiler be unable to inline invokation of a function pointer, while it'd inline an equivalent invokation of a functor?
Edited my post to answer that one better.
jalf
Well, sure, if you're just passing a "Ftor" object then it's not a problem to inline it. But then the equivalent functorless implementation would NOT take a function pointer - it'd take a ref the same thing Ftor's constructor does. For a fair comparison, the functor would have to be polymorphic :)
I never said they were equivalent. That's why I said to *prefer* functors. Not "always use functors", because obviously you can't. But use them when you can. I'm not sure what you mean about the functorless version. What would you want it to take, a ref to a function? Doesn't solve the problem.
jalf
Of course when you work with functors, you'd typically make the exact type a template parameter, so it would use static polymorphism rather than the dynamic variant.
jalf
Hm, actually I nearly always call functors polymorphically because I defer the call for later. I haven't written generic algorithms like std::sort (with a custom static-polymorphic comparator), instead I've had Screensaver::setLaunchCondition(function<bool()>).
Besides, if you know the functor type at compile time, you might as well take a function pointer as a template arg: template<void(*func)()> bla(); - perfectly inlineable while not using an actual functor.
No, that's not inlineable, because the type says nothing about which function is called. It only says "a function taking no arguments and returning void" will be called. The compiler can't inline that. That is precisely the problem. The function has to be identifiable through the type to be inlined
jalf
No, you don't get it. Maybe you aren't aware of this C++ feature. Run this one: template<void(*func)()> void bla() { func(); } void test() { cout << "test"; } int main() { bla<test>(); }
When I talk about using a function pointer, I didn't mean as a template parameter. I meant a function like this:void bla(void(*func)() f) { ... f(); }And that can not be inlined easily. Your method is another way to allow inlining, at the loss of some flexibility.
jalf
Right. I see you know what you're talking about. I'll leave it at that.
+1  A: 
  1. Profile your code to find what's actually taking the most time
  2. Make sure you're using the right algorithm
  3. If you're doing a lot of number crunching, be friendly with your cache and try to do everything possible on a chunk of data all at once so it doesn't have to be loaded into cache many times.
Mr Fooz
+5  A: 

Don't forget about few things:
- "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." (c) Donald Knuth
- We could get more if we will optimize algorithms than code.
- We will optimize only slow parts of existing code, which will be detected by profiler or other special tool.

bb
+3  A: 

Agner Fog has done a great job analyzing the output of several compilers regarding C++ constructs. You will find his work here: http://www.agner.org/optimize/.

Intel offers a great document too - the "Intel® 64 and IA-32 Architectures Optimization Reference Manual", which you will find at http://www.intel.com/products/processor/manuals/index.htm. Although it mainly targets IA-32 architectures, it contains general advice that can be applied on most platforms. Obviously, it and Agner Fog's guide do intersect a bit.

As mentioned in other answers, micro-optimization is obviously the last step you want take to make your program faster, after profiling and algorithm selection.

RaphaelSP
+13  A: 

I will echo what others have said: a better algorithm is going to win in terms of performance gains.

That said, I work in image processing, which as a problem domain can be stickier. For example, many years ago I had a chunk of code that looked like this:

void FlipBuffer(unsigned char *start, unsigned char *end)
{
    unsigned char temp;

    while (start <= end) {
        temp = _bitRev[*start];
        *start++ = _bitRev[*end];
        *end-- = temp;
    }
 }

which rotates a 1-bit frame buffer 180 degrees. _bitRev is a 256 byte table of reversed bits. This code is about as tight as you can get it. It ran on an 8MHz 68K laser printer controller and took roughly 2.5 seconds for a legal sized piece of paper. To spare you the details, the customer couldn't bear 2.5 seconds. The solution was an identical algorithm to this. The difference was that

  1. I used a 128K table and operated on words instead of bytes (the 68K is much happier on words)
  2. I used Duff's device to unroll the loop as much as would fit within a short branch
  3. I put in an optimization to skip blank words
  4. I finally rewrote it in assembly to take advantage of the sobgtr instruction (subtract one and branch on greater) and have "free" post increment and pre-decrements in the right places.

So 5x: no algorithm change.

The point is that you also need to understand your problem domain and what bottlenecks means. In image processing, algorithm is still king, but if your loops are doing extra work, multiply that work by several million and that's the price you pay.

plinth
I wonder whether using C++ templates instead of assembly will achieve the goals you had. Or in other words, is sobgtr "accessible" within c++? If not, how to make it accessible?
Amit Kumar
A: 

++p is usually faster than p++ and --p is faster than p--, especially for objects of types with overloaded prefix and postfix increment and decrement operators, because prefix form just increments or decrements something and returns the new value, whereas the postfix form increments or decrements something, but has to keep the old value somewhere to return it. That is, instead of (replace int with your favorite class here)

for ( int i ( 0); i < x; i++)

always write

for ( int i ( 0); i < x; ++i)
dmityugov
But you aren't going to get any significant performance improvements from changing all your post increments to pre increments on the built-in types like int. Bad example. For user defined types, sure. But not ints.
Brian Neal
depends on the compiler, really
dmityugov
Can you mention any real compiler that doesn't optimize this for built-in types when optimizations are enabled?
Mr Fooz
It's a good habit to get into, and it's easier to have the pre- habit than to have the pre- habit only with user-defined types.
David Thornley
Mr Fooz, comments at SO are limited to 300 characters so no, I can't. Create a question please, and I'll try to answer it
dmityugov
You could rephrase you your post to "'++i' is at least as fast as 'i++', and sometimes faster" (see also http://www.parashift.com/c++-faq-lite/operator-overloading.html#faq-13.15).I think you don't deserve to be rated down, because what you say is not wrong per se. Edit yes, rate down is wrong.
phresnel
thanks for the suggestion phresnel, "++p is usually faster than p++" in my answer indeed means "++p is at least as fast as p++, and sometimes faster", it just contains less characters
dmityugov
+2  A: 

It is common in other engineering disciplines to assign budgets to the components of the system. For example, the engines of a VTOL aircraft are designed to provide a certain amount of lift, so the weight must be within a limit. At a high level, each part of the aircraft is given a portion of the weight budget which it should meet.

The reason this is done top down, rather than waiting until it's too bloated to get off the deck and then weighing each part and filing a bit off of the heaviest bit, is partly due to the cost of changing fabricated components. But a large part of it is that if you create a system where everything is a bit over budget everywhere, you can't just fix it in one place.

The classic software example is the SGI Indy Irix 5.1, which partly is why graphics intensive users have Macs and Windows machines now rather than SGI boxes.

"What's most frightening about the 5.1 performance is that nobody knows exactly where it went. If you start asking around, you get plenty of finger-pointing and theories, but few facts. In the May report, I proposed a "5% theory", which states that each little thing we add (Motif, internationalization, drag-and-drop, DSOs, multiple fonts, and so on) costs roughly 5% of the machine. After 15 or 20 of these, most of the performance is gone."

Frequently in discussions of performance, 5% is is said to be insignificant, and the advice is to wait until there is a problem and then look for a single bottleneck. For a large system, waiting until you have a problem may just lose you your main business.

Pete Kirkham
> For a large system, waiting until you have a problem may just lose you your main business.Do you have an instance of this? Usually I would think that if the large system is well-designed it would be more amenable to change. OTOH modifying design for performance right from the start makes the system difficult to change and therefore to optimize.
Amit Kumar
@Amit are SGI doing so well these days?
Pete Kirkham
"5% is is said to be insignificant, and the advice is to wait until there is a problem and then look for a single bottleneck." This is the wrong part. There has to sufficient time allocated for later optimization, particularly for the case when performance is critical to the marketability. Much better, performance monitoring is a daily job of the engineers. For example: http://www.dre.vanderbilt.edu/Stats/performance.shtml
Amit Kumar
+1  A: 

You asked for sites/sources containing optimization wisdom.

Some good ones have been suggested.

I might add that they will nearly all say that profiling is the best if not the only way to locate performance problems.

I'm not sure where this folk-wisdom originated or how it was justified, but there is a better way.

ADDED:

It is true that the "wrong algorithm" can kill performance, but that's certainly not the only way.

I do a lot of performance tuning. On large software, what usually kills performance is too much data structure and too many layers of abstraction.

What seem like innocent one-liner method calls to abstract objects tempt you to forget what that call could cost you. Multiply this tendency over several layers of abstraction, and you find things like, for example, spending all your time allocating and collecting things like iterators and collection classes when simple arrays with indexing would have been sufficient (and no less maintainable) but less "proper".

That's the problem with "common wisdom". It often is exactly the opposite of wisdom.

Mike Dunlavey
It sounds like you're advocating making code less readable to improve performance. You shouldn't do that unless you know that you need better performance, and that you're changing code where it matters (as found by a profiler, or by your idea of hand-profiling).
David Thornley
@David: Right. Find out where it matters. I've found that if you do this for a while, you tend not to over-design things in the first place, because oddly enough much of the complexity has the implicit purpose of "performance", and you know it will do the opposite.
Mike Dunlavey
Mike Dunlavey
Periodic stack snapshots (the technique you link) is simply another form of profiling. This is the form of profiling provided by the OS X developer tool shark.
Michael Anderson
@Michael Anderson: All forms of profiling are not equivalent. Apple is deprecating shark and replacing it with instruments. I just looked up the sampling doc on instruments. They sample the stack, even when blocked - that's very good. They talk about the call tree, top down or bottom up. I couldn't tell if that includes middle-out, which is important because a mid-level routine may cost a lot but since it is could be in many different branches you might not see its true cost unless it is the focus of the butterfly view. I don't see anything about lines of code, only functions. ...
Mike Dunlavey
@Michael Anderson: ... That's only useful if functions are small. Functions are not always small, regardless of what some think they *should* be. The shark call tree shows percent. I hope instruments call tree does also, and I hope that is the percent of stack samples going through that node. They apologize that the numbers are imprecise, to which I reply that precision of measurement is not a useful goal. They apologize that the frequency of samples per second is not very high, to which I give the same reply. So, bottom line, they're much better than profilers of old, but ...
Mike Dunlavey
@Michael Anderson: ... there are some ways they need to change to get to the same level as Zoom or LTProf. (Considering only the performance sampling part - in other respects, I have no comment.) If they get as good as Zoom or LTProf, which they could easily, then what separates them from the manual technique is still that they are concentrating on measuring, rather than on the insight you get from looking in detail at representative samples, one at a time. Not to mention context. Seeing additional variables in the program state can make all the difference.
Mike Dunlavey
A: 

Here are a couple catch all paths for optimization.

There is no one way for optimization problems... they are always hand tuned to the hardware/software/system-considerations.


Assuming you have the best algorithm:

  1. compile with "show assembly output" and "highest optimization"
  2. look at the assembly output
  3. identify inefficiencies that ruin compiler optimization or bad caching
  4. Re-code the snippet
  5. if it is still bad loop back to 2.
  6. done

Example seen here: http://stackoverflow.com/questions/36906/what-is-the-fastest-way-to-swap-values-in-c/615995#615995


General tips:

http://www.ceet.niu.edu/faculty/kuo/exp/exp6/figuree6-1.jpg:

  • Try it in floating point first
  • Try it in fixed point second
  • If you are really disparate and have lots of time and money, try it in assembly

http://www.asc.edu/seminars/optimization/fig1.gif:

  • Check if it is memory OR I/O OR CPU bound
  • Attack which ever is the limiting factor


Trevor Boyd Smith
You're right for routines that are 1) tight loops not containing call instructions, 2) that sampling shows use high % of time, 3) that you actually compile. I'm sure that applies in DSP. In big systems I don't usually see it.
Mike Dunlavey
In times of SSE5, fixed point is no longer necessarily faster then floating point, just be sure to know your compiler parameters.
phresnel
+1  A: 

Most techniques are compiler-specific, as different compilers optimize differently.

If you want some optimization tips that are compiler agnostic, here are two for you:

  1. Don't do it.
  2. (for experts only!): Don't do it yet.

(apologies to Michael A. Jackson)

T.E.D.
+1  A: 

Sorry I don't have any references for you, but I do have another anecdote to add to the pile.

I had a rather large std::map that I was generating using Microsoft's CString object as the key. Performance was unacceptable. Since all of my strings were identical in length, I created a class wrapper around an old-fashioned fixed-size array of chars, to emulate the interface of CString. Unfortunately I can't remember the exact speedup, but it was significant, and the resulting performance was more than adequate.

Sometimes you need to know a little about the library constructs you rely upon.

Mark Ransom
+1  A: 

You might be interested in this: Optimizing C++ Wikibook

ShaChris23