views:

5153

answers:

25

There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.

Let's assume:

  • the code already is working correctly
  • the algorithms chosen are already optimal for the circumstances of the problem
  • the code has been measured, and the offending routines have been isolated
  • all attempts to optimize will also be measured to ensure they do not make matters worse

What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.

Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.

I'll add a reply with my own initial suggestions, and look forward to whatever else the SO community can think of.

+25  A: 

Throw more hardware at it!

Simon Svensson
My thoughts exactly. When you start talking about "last few percents" when there is nothing left, the clearly cheapest way of making it run faster is faster hardware rather than spending a ton of programmer time on squeezing those last percents out of it.
Stefan Thyberg
+1 an answer that especially managers seem to overlook too often ;)
jerryjvl
Although I want to encourage more software-oriented answers as well, because sometimes you may already have developers with spare time on their hands, and no money to invest in hardware... or sometimes you may already have the fastest available hardware and still not enough performance. Especially considering that 'faster hardware' is looking less and less likely to be a never-ending path, and not everything parallellises
jerryjvl
more hardware isn't always an option when you have software that is expected to run on hardware already out in the field.
Doug T.
Not a very helpful answer to someone making consumer software: the customer isn't going to want to hear you say, "buy a faster computer." Especially if you're writing software to target something like a video game console.
Crashworks
If you believe Moore's Law runs at x2/18 months, that's a compounded rate of ~0.13%/day (and it works weekends and doesn't take holidays :^). Busting a gut on last-few-percent optimization has to be weighed against what you can get for free by just waiting. If you take a long enough view, this applies just as much in the console/consumer/embedded space (the performance graphs are just more quantized), and you can still make a case for moving on to the next thing rather than investing time on something with small and diminishing returns.
timday
@Crashworks, or for that matter, an embedded system. When the last feature is finally in and the first batch of boards are already spun is not the moment to discover that you should have used a faster CPU in the first place...
RBerteig
Technically, that's the perfomance strategy of first resort.
MSN
I once had to debug a program that had a huge memory leak -- its VM size grew by about 1Mb per hour. A colleague joked that all I needed to do was add memory *at a constant rate*. :)
j_random_hacker
+3  A: 

Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.

dschwarz
+41  A: 

Suggestions:

  • Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
    Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
  • Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
    Down-sides: writing additional code means more surface area for bugs.
  • Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
  • Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
    Down-sides: Well... the answer won't be exact.
jerryjvl
Precomputation doesn't always help, and it can even hurt sometimes -- if your lookup table is too big, it can kill your cache performance.
Adam Rosenfield
Good point... forgot to write down the down-sides...
jerryjvl
Cheating can often be the win. I had a color correction process that at the core was a 3-vector dotted with a 3x3 matrix. The CPU had a matrix multiply in hardware that left out some of the cross terms and went real fast compared to all the other ways to do it, but only supported 4x4 matrices and 4-vectors of floats. Changing the code to carry around the extra empty slot and converting the calculation to floating point from fixed point allowed for a slightly less-accurate but *much* faster result.
RBerteig
+ Yes, these are things that can be done when you have found a problem. The part I would emphasize is the "finding" skill. Most folks would say "All you gotta do is profile / measure", but to me, that is not enough. What's more, the repetition is key.
Mike Dunlavey
+6  A: 
  • What hardware are you running on? Can you use platform-specific optimizations (like vectorization)?
  • Can you get a better compiler? E.g. switch from GCC to Intel?
  • Can you make your algorithm run in parallel?
  • Can you reduce cache misses by reorganizing data?
  • Can you disable asserts?
  • Micro-optimize for your compiler and platform. In the style of, "at an if/else, put the most common statement first"
kotlinski
Should be "switch from GCC to LLVM" :)
Zifre
+4  A: 
  • Inline routines (eliminate call/return and parameter pushing)
  • Try eliminating tests/switches with table look ups (if they're faster)
  • Unroll loops (Duff's device) to the point where they just fit in the CPU cache
  • Localize memory access so as not to blow your cache
  • Localize related calculations if the optimizer isn't already doing that
  • Eliminate loop invariants if the optimizer isn't already doing that
plinth
IIRC Duff's device is very rarely faster. Only when the op is very short (like a single small math expression)
BCS
+1  A: 

Any opportunity for parallelization?

n8wrl
+2  A: 

Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.

Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.

Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.

Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".

jalf
+3  A: 

If better hardware is an option then definitely go for that. Otherwise

  • Check you are using the best compiler and linker options.
  • If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
  • See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
  • Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
  • Consider: are you really using the best algorithm? Is it the best algorithm for your input size?
mealnor
+65  A: 

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November '93, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:

  • The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.

  • Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 sec.

  • Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.

Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.

Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.

  • That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.

Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.

  • More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.

  • Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.

  • Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.

  • Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.

Total speedup factor: 43.6

Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on SO.

P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions. I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.

ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:

 /* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)

These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.

Here is the second problem, in two separate lines:

 /* ADD TASK TO TASK LIST */ 
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)

These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.

I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.

REFERENCE ADDED: The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip .

Mike Dunlavey
I'd love to read some of the details of the steps you outline above. Is it possible to include some fragments of the optimizations for flavour? (without making the post too long?)
jerryjvl
@jerryjvl: I'll see if I can post the sequence of sources (and maybe page images) somewhere. Bear with me.
Mike Dunlavey
... I'm not sure if I've got copyright issues with Dr. Dobbs.
Mike Dunlavey
... I also wrote a book that's now out of print, so it's going for a ridiculous price on Amazon - "Building Better Applications" ISBN 0442017405. Essentially the same material is in the first chapter.
Mike Dunlavey
Thanks for the additions,... I really wanted to accept your answer as a very good example of there being more room for improvement than you'd expect even when everything is already 'just right'... but I thought a few concrete points would improve the answer.
jerryjvl
And thanks for adding the source code reference :)
jerryjvl
+1: excellent post.
Peter Mortensen
+11  A: 

Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.

Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.

Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.

Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).

Index, index, index. And get those stats updated if applicable to your database.

Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.

Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.

Never put a loop of any kind into a trigger!

Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.

Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.

Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.

HLGEM
+4  A: 

You should probably consider the "Google perspective", i.e. determine how your application can become largely parallelized and concurrent, which will inevitably also mean at some point to look into distributing your application across different machines and networks, so that it can ideally scale almost linearly with the hardware that you throw at it.

On the other hand, the Google folks are also known for throwing lots of manpower and resources at solving some of the issues in projects, tools and infrastructure they are using, such as for example whole program optimization for gcc by having a dedicated team of engineers hacking gcc internals in order to prepare it for Google-typical use case scenarios.

Similarly, profiling an application no longer means to simply profile the program code, but also all its surrounding systems and infrastructure (think networks, switches, server, RAID arrays) in order to identify redundancies and optimization potential from a system's point of view.

none
+3  A: 

I think this has already been said in a different way. But when you're dealing with a processor intensive algorithm, you should simplify everything inside the most inner loop at the expense of everything else.

That may seem obvious to some, but it's something I try to focus on regardless of the language I'm working with. If you're dealing with nested loops, for example, and you find an opportunity to take some code down a level, you can in some cases drastically speed up your code. As another example, there are the little things to think about like working with integers instead of floating point variables whenever you can, and using multiplication instead of division whenever you can. Again, these are things that should be considered for your most inner loop.

Sometimes you may find benefit of performing your math operations on an integer inside the inner loop, and then scaling it down to a floating point variable you can work with afterwards. That's an example of sacrificing speed in one section to improve the speed in another, but in some cases the pay off can be well worth it.

Steve Wortham
+28  A: 

When you can't improve the performance any more - see if you can improve the perceived performance instead.

You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.

A few examples:

  • anticipating what the user is going to request and start working on that before then
  • displaying results as they come in, instead of all at once at the end
  • Accurate progress meter

These won't make your program faster, but it might make your users happier with the speed you have.

kenj0418
+28  A: 

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:

  • Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
  • Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
  • Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
  • Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
  • Sequential floating-point ops. Make these SIMD.

And one more thing I like to do:

  • Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.
Crashworks
What profiler(s) do you use? And are any of them applicable to C# .NET development? I'd love to try some of these myself.
jerryjvl
I mostly use Intel VTune and PIX. No idea if they can adapt to C#, but really once you've got that JIT abstraction layer most of these optimizations are beyond your reach, except for improving cache locality and maybe avoiding some branches.
Crashworks
Even so, checking on the post-JIT output may help figure out if there are any constructs that just do not optimize well through the JIT stage... investigation can never hurt, even if turns out a dead end.
jerryjvl
+1  A: 

Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.

Nosredna
A: 

There is no such blanket statement possible, it depends on the problem domain. Some possibilities:

Since you don't specify outright that your application is 100% calculating:

  • Search for calls that block (database, network harddisk, display update), and isolate them and/or put them in thread.

If you have use a database and it happens to be Microsoft SQL Server:

  • investigate nolock and rowlock directives. (There are threads on this forum.)

IF your app is purely calculating, you can look at this question of mine about cache optimization for rotating large images. The increase in speed flabbergasted me.

It is a long shot, but maybe it gives an idea, specially if your problem is in the imaging domain: rotating-bitmaps-in-code

Another one is avoiding dynamic memory allocation as much as possible. Allocate multiple structs at once, release them at once.

Otherwise, identify your tightest loops and post them here, either in pseudo or not, with some of the datastructures.

Marco van de Voort
+5  A: 
  • When you get to the point that you're using efficient algorithms its a question of what you need more speed or memory. Use caching to "pay" in memory for more speed or use calculations to reduce the memory footprint.
  • If possible (and more cost effective) throw hardware at the problem - faster CPU, more memory or HD could solve the problem faster then trying to code it.
  • Use parallelization if possible - run part of the code on multiple threads.
  • Use the right tool for the job. some programing languages create more efficient code, using managed code (i.e. Java/.NET) speed up development but native programing languages creates faster running code.
  • Micro optimize. Only were applicable you can use optimized assembly to speed small pieces of code, using SSE/vector optimizations in the right places can greatly increase performance.
Dror Helper
+9  A: 

More suggestions:

  • Avoid I/O: Any I/O (disk, network, ports, etc.) is always going to be far slower than any code that is performing calculations, so get rid of any I/O that you do not strictly need.

  • Move I/O up-front: Load up all the data you are going to need for a calculation up-front, so that you do not have repeated I/O waits within the core of a critical algorithm (and maybe as a result repeated disk seeks, when loading all the data in one hit may avoid seeking).

  • Delay I/O: Do not write out your results until the calculation is over, store them in a data structure and then dump that out in one go at the end when the hard work is done.

  • Threaded I/O: For those daring enough, combine 'I/O up-front' or 'Delay I/O' with the actual calculation by moving the loading into a parallel thread, so that while you are loading more data you can work on a calculation on the data you already have, or while you calculate the next batch of data you can simultaneously write out the results from the last batch.

jerryjvl
+1  A: 

In a language with templates (C++/D) you can try propagating constant values via template args. You can even do this for small sets of not really constant values with a switch.

Foo(i, j); // i always in 0-4.

becomes

switch(i)
{
    case 0: Foo<0>(j); break;
    case 1: Foo<1>(j); break;
    case 2: Foo<2>(j); break;
    case 3: Foo<3>(j); break;
    case 4: Foo<4>(j); break;
}

The downside is cache pressure so this would only be a gain in deep or long running call trees where the value is constant for the duration.

BCS
+7  A: 

The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.

These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.

A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:

  • Avoid memory fetch latencies.
  • Lower memory bus pressure (bandwidth).

The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.

The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.

Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.

Mats N
+2  A: 

Last few % is a very CPU and application dependent thing....

  • cache architectures differ, some chips have on-chip RAM you can map directly, ARM's (sometimes) have a vector unit, SH4's a useful matrix opcode. Is there a GPU - maybe a shader is the way to go. TMS320's are very sensitive to branches within loops (so separate loops and move conditions outside if possible).

The list goes on.... But these sorts of things really are the last resort...

Build for x86, and run Valgrind/Cachegrind against the code for proper performance profiling. Or Texas Instruments' CCStudio has a sweet profiler. Then you'll really know where to focus...

Cwaig
+2  A: 

Caching! A cheap way (in programmer effort) to make almost anything faster is to add a caching abstraction layer to any data movement area of your program. Be it I/O or just passing/creation of objects or structures. Often it's easy to add caches to factory classes and reader/writers.

Sometimes the cache will not gain you much, but it's an easy method to just add caching all over and then disable it where it doesn't help. I've often found this to gain huge performance without having to micro-analyse the code.

Killroy
+1  A: 

Tweak the OS and framework.

It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.

Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp).

Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (e.g. Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.

Of course, you should always throw in more hardware as the first step...

Nir Levy
+1  A: 

The google way is one option "Cache it.. Whenever possible don't touch the disk"

Vadi
+1  A: 

Divide and conquer

If the dataset being processed is too large, loop over chunks of it. If you've done your code right, implementation should be easy. If you have a monolithic program, now you know better.

MPelletier