views:

1891

answers:

12

This could sound a subjective question, but what i am looking for is specific instances which you would have encountered related to this.

  1. How to make a code, cache effective-cache friendly? (More cache hits, as less cahce misses as possible). from both perspectives, data cache & program cache(instruction cache). i.e. What all things in one's code, related to data structures, code constructs one should take care of to make it cache effective.

  2. Are there any particular data structures one must use, must avoid,or particular way of accessing the memers of that structure etc.. to make code cache effective.

  3. Are there any program constructs(if, for, switch, break, goto,...), code-flow(for inside a if, if inside a for, etc...) one should follow/avoid in this matter?

I am looking forward to hear individual experiences related to making a cache efficient code in general. It can be any programming language(C,C++,ASsembly,...), any hardware target(ARM,Intel,PowerPC,...), any OS(Windows,Linux,Symbian,...) etc..

More the variety, it will help better to understand it deeply.

+1  A: 

I can answer (2) by saying that in the C++ world, linked lists can easily kill the CPU cache. Arrays are a better solution where possible. No experience on whether the same applies to other languages, but it's easy to imagine the same issues would arise.

Andrew
@Andrew: How about structures. Are they cache efficient? Do they have any size constraints to be cache efficient?
goldenmean
A struct is a single block of memory, so as long as it doesn't exceed the size of your cache you won't see an impact. It's only when you have a collection of structs (or classes) that you'll see cache hits and it depends on the way you organise the collection. An array butts the objects up against each other (good) but a linked list can have objects all over your address space with links between them, which is obviously bad for cache performance.
Andrew
Some way to use linked lists without killing the cache, most effective for not big lists, is to create your own memory pool, that is - to allocate one large array. then instead of 'malloc'ing (or 'new'ing in C++) memory for each little linked list member, which may be allocated in an entirely different place in memory, and waste managing space, you give it memory from your memory pool, highly increasing the odds that logically close members of the list, will be on the cache together.
Liran Orevi
Sure, but it's a lot of work getting std::list<> et al. to use your custom memory blocks. When I was a young whippersnapper I'd absolutely go that path, but these days... too many other things to tackle.
Andrew
+1  A: 

Cache is arranged in "cache lines" and (real) memory is read from and written to in chunks of this size.

Data structures that are contained within a single cache-line are therefore more efficient.

Similarly, algorithms which access contiguous memory blocks will be more efficient than algorithms which jump through memory in a random order.

Unfortunately the cache line size varies dramatically between processors, so there's no way to guarantee that a data structure that's optimal on one processor will be efficient on any other.

Alnitak
+8  A: 

I can't believe there aren't more answers to this. Anyway, one classic example is to iterate a multidimensional array "inside out":

pseudocode
for (i = 0 to size)
  for (j = 0 to size)
    do something with ary[j][i]

The reason this is cache inefficient is because modern CPUs will load the cache line with "near" memory addresses from main memory when you access a single memory address. We are iterating through the "j" (outer) rows in the array in the inner loop, so for each trip through the inner loop, the cache line will cause to be flushed and loaded with a line of addresses that are near to the [j][i] entry. If this is changed to the equivalent:

for (i = 0 to size)
  for (j = 0 to size)
    do something with ary[i][j]

It will run much faster.

1800 INFORMATION
Thanks for a good relevant pointer applicable to my area of work, Video Coding.
goldenmean
Classic example.
Liran Orevi
back in college we had an assignment on matrix multiplication. It turned out that it was faster to take a transpose of the "columns" matrix first and multiply rows by rows rather than rows by cols for that precise reason.
ykaganovich
actually, most of the modern compilers can figure this out by itselves (with optimizations turned on)
Ricardo Nolde
+3  A: 

The most effective data structure for a cache is an array. Caches work best, if your data structure is laid out sequentially as CPUs read entire cache lines (usually 32 bytes or more) at once from main memory.

Any algorithm which accesses memory in random order trashes the caches because it always needs new cache lines to accomodate the randomly accessed memory. On the other hand an algorithm, which runs sequentially through an array is best because:

  1. It gives the CPU a chance to read-ahead, e.g. speculatively put more memory into the cache, which will be accessed later. This read-ahead gives a huge performance boost.

  2. Running a tight loop over a large array also allows the CPU to cache the code executing in the loop and in most cases allows you to execute an algorithm entirely from cache memory without having to block for external memory access.

__grover
@Grover: About your point 2. so can one say that if insidea tight loop, a function is being called for each loop count, then it wil fetch new code altogether and cause a cache miss, instead if u can put the function as a code in the for loop itself,no function call, it would be faster due to fewer cache misses?
goldenmean
Yes and no. The new function will be loaded in the cache. If there's enough cache space, then on second iteration it'll already have that function in the cache so there's no reason to reload it again. So it is a hit on the first call. In C/C++ you can ask the compiler to place functions right next to each other using appropriate segments.
__grover
One more note: If you call out of the loop and there's not enough cache space the new function will be loaded into cache regardless. It may even happen that the original loop will be thrown out of the cache. In this case the call will incur up to three penalties for each iteration: One to load the call target and another to reload the loop. And a third if the loop head is not in the same cache line as the call return address. In that case jumping to the loop head also needs a new memory access.
__grover
+5  A: 

Apart from data access patterns, a major factor in cache-friendly code is data size. Less data means more of it fits into the cache.

This is mainly a factor with memory-aligned data structures. "Conventional" wisdom says data structures must be aligned at word boundaries because the CPU can only access entire words, and if a word contains more than one value, you have to do extra work (read-modify-write instead of a simple write). But caches can completely invalidate this argument.

Similarly, a Java boolean array uses an entire byte for each value in order to allow operating on individual values directly. You can reduce the data size by a factor of 8 if you use actual bits, but then access to individual values becomes much more complex, requiring bit shift and mask operations (the BitSet class does this for you). However, due to cache effects, this can still be considerably faster than using a boolean[] when the array is large. IIRC I once achieved a speedup by a factor of 2 or 3 this way.

Michael Borgwardt
Last paragraph was quite interesting.
Skurmedel
By "word" you mean 1 byte per boolean, right?
Seun Osewa
@Seun: you're right, I got that wrong. Fixed now.
Michael Borgwardt
A: 

Write your program to take a minimal size. That is why it is not always a good idea to use -O3 optimisations for GCC. It takes up a larger size. Often, -Os is just as good as -O2. It all depends on the processor used though. YMMV.

Work with small chunks of data at a time. That is why a less efficient sorting algorithms can run faster than quicksort if the data set is large. Find ways to break up your larger data sets into smaller ones. Others have suggested this.

In order to help you better exploit instruction temporal/spatial locality, you may want to study how your code gets converted in to assembly. For example:

for(i = 0; i < MAX; ++i)
for(i = MAX; i > 0; --i)

The two loops produce different codes even though they are merely parsing through an array. In any case, your question is very architecture specific. So, your only way to tightly control cache use is by understanding how the hardware works and optimising your code for it.

sybreon
Interesting point. Do look-ahead caches make assumptions based on the direction of a loop/pass through memory?
Andrew
There are many ways to design speculative data caches. Stride based ones do measure the 'distance' and 'direction' of data accesses. Content based ones chase pointer chains. There are other ways to design them.
sybreon
+6  A: 

I recommend reading the 9-part article What every programmer should know about memory by Ulrich Drepper if you're interested in how memory and software interact. It's also available as a 104-page PDF.

Sections especially relevant to this question might be Part 2 (CPU caches) and Part 5 (What programmers can do - cache optimization).

Tomi Kyöstilä
+3  A: 

The basic rules are actually fairly simple. Where it gets tricky is in how they apply to your code.

The cache works on two principles: Temporal locality and spatial locality. The former is the idea that if you recently used a certain chunk of data, you'll probably need it again soon. The latter means that if you recently used the data at address X, you'll probably soon need address X+1.

The cache tries to accomodate this by remembering the most recently used chunks of data. It operates with cache lines, typically sized 128 byte or so, so even if you only need a single byte, the entire cache line that contains it gets pulled into the cache. So if you need the following byte afterwards, it'll already be in the cache.

And this means that you'll always want your own code to exploit these two forms of locality as much as possible. Don't jump all over memory. Do as much work as you can on one small area, and then move on to the next, and do as much work there as you can.

A simple example is the 2D array traversal that 1800's answer showed. If you traverse it a row at a time, you're reading the memory sequentially. If you do it column-wise, you'll read one entry, then jump to a completely different location (the start of the next row), read one entry, and jump again. And when you finally get back to the first row, it will no longer be in the cache.

The same applies to code. Jumps or branches mean less efficient cache usage (because you're not reading the instructions sequentially, but jumping to a different address). Of course, small if-statements probably won't change anything (you're only skipping a few bytes, so you'll still end up inside the cached region), but function calls typically imply that you're jumping to a completely different address that may not be cached. Unless it was called recently.

Instruction cache usage is usually far less of an issue though. What you usually need to worry about is the data cache.

In a struct or class, all members are laid out contiguously, which is good. In an array, all entries are laid out contiguously as well. In linked lists, each node is allocated at a completely different location, which is bad. Pointers in general tend to point to unrelated addresses, which will probably result in a cache miss if you dereference it.

And if you want to exploit multiple cores, it can get really interesting, as usually, only one CPU may have any given address in its L1 cache at a time. So if both cores constantly access the same address, it will result in constant cache misses, as they're fighting over the address.

jalf
+1, good and practical advice. One addition: Time locality and space locality combined suggest, that for matrix ops for example, it might be advisable to split them into smaller matrices that completely fit into a cache line, or whose rows/columns fit into cache lines. I remember doing that for visualization of multidim. data. It provided some serious kick in the pants. It is good to remember that the cache does hold more than one 'line' ;)
AndreasT
A: 

A good (though trivial) example is using a hashtable (or similar data structure) to store the results of the calculations after they have been calculated initially. You can get some significant speed benefits on subsequent calculations.

Conrad
+1  A: 

To ask how to make a code, cache effective-cache friendly and most of the other questions , is usually to ask how to Optimize a program, that's because the cache has such a huge impact on performances that any optimized program is one that is cache effective-cache friendly.

I suggest reading about Optimization, there are some good answers on this site. In terms of books, I recommend on Computer Systems: A Programmer's Perspective which has some fine text about the proper usage of the cache.

(b.t.w - as bad as a cache-miss can be, there is worse - if a program is paging from the hard-drive...)

Liran Orevi
+2  A: 

Only one post touched on it, but a big issue comes up when sharing data between processes. You want to avoid having multiple processes attempting to modify the same cache line simultaneously. Something to look out for here is "false" sharing, where two adjacent data structures share a cache line and modifications to one invalidates the cache line for the other. This can cause cache lines to unnecessarily move back and forth between processor caches sharing the data on a multiprocessor system. A way to avoid it is to align and pad data structures to put them on different lines.

RussellH
+10  A: 

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).

Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.

Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.

Improving cache line utilization helps in three respects:

  • It tends to fit more useful data in the cache, essentially increasing the effective cache size.
  • It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
  • It reduces the memory bandwidth requirements, as there will be fewer fetches.

Common techniques are:

  • Use smaller data types
  • Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
  • Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
  • Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
  • avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.

We should also note that there are other ways to hide memory latency than using caches.

Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.

Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).

To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.

Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.

While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.

These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.

Mats N