views:

785

answers:

6

How to decrease the number of possible cache misses when designing a C++ program?

Does inlining functions help every time? or is it good only when the program is CPU-bounded (i.e. the program is computation oriented not I/O oriented)?

+1  A: 

Allow CPU to prefetch data efficiently. For example you can decrease number cache misses processing multi-dimensional arrays by rows rather than by columns, unroll loops etc.

This kind of optimization depends on hardware architecture, so you better use some kind of platform-specific profiler like Intel VTune to detect possible problems with cache.

aku
+4  A: 

There is a very nice video by Herb Sutter that mentions this topic here

For data bound operations

  1. use arrays & vectors over lists,maps & sets

  2. process by rows over columns

Chris
A: 

Avoid using dynamic memory when it's not necessary. Using new, delete, smart pointers, and so on, tends to spread your program data across the memory. That's not good. If you can keep most of your data together (by declaring objects on the stack, for example), your cache will surely work much better.

Marc
+1  A: 

Inlining functions runs can risk harming the instruction cache. And if memory is not fetch bound, then it is unlikely to make much ( if any ) difference.

As always, any optimization should be informed by profiling rather than hunches. Not to mention that you will need to understand what the profiler is telling you which implies familiarity with assembly language and the particular characteristics of the plaftorm you are optimising for.

A bit old now, but Mike Abrash's "Graphic's Programming Black Book" still has lots of good general advice.

Tom Grove
+9  A: 

Here are some things that I like consider when working on this kind of code.

  • Consider whether you want "structures of arrays" or "arrays of structures". Which you want to use will depend on each part of the data.
  • Try to keep structures to multiples of 32 bytes so they pack cache lines evenly.
  • Partition your data in hot and cold elements. If you have an array of objects of class o, and you use o.x, o.y, o.z together frequently but only occasionally need to access o.i, o.j, o.k then consider puting o.x, o.y, and o.z together and moving the i, j, and k parts to a parallel axillary data structure.
  • If you have multi dimensional arrays of data then with the usual row-order layouts, access will be very fast when scanning along the preferred dimension and very slow along the others. Mapping it along a space-filling curve instead will help to balance access speeds when traversing in any dimension. (Blocking techniques are similar -- they're just Z-order with a larger radix.)
  • If you must incur a cache miss, then try to do as much as possible with that data in order to amortize the cost.
  • Are you doing anything multi-threaded? Watch out for slowdowns from cache consistency protocols. Pad flags and small counters so that they'll be on separate cache lines.
  • SSE on Intel provides some prefetch intrinsics if you know what you'll be accessing far enough ahead of time.
Boojum
+2  A: 

Also if your doing C++, and multithreading you need to consider false sharing, locality and the hot-ness of the data on the cache of each processor. That can make a big difference. Also especially in multithreading computing things in a LIFO manner is more efficient than computing in a FIFO manner, but its also valid in single processor architecture.

Robert Gould