Are modern CPU caches optimized to deal with constant strides? Across threads?

Say I have a big array, and multiple threads reading from the array. Each thread iterates through the array by jumping a constant amount, but starts at a different offset. So thread 1 may start at element 0, then read elements 32, 64, 96, etc. But thread 2 starts at element 1, and read element 33, 65, 97, etc. (keeping in mind that an 'element' may constitute more than a single byte or word) I know that usually spatial locality is desirable for getting the best cache performance, but I've also read that modern CPUs have hardware prefetchers that look for patterns in accesses, and a stride to me seems like an obvious pattern.

So is this cache friendly on a modern box or isn't it?
What if I increase the stride to something larger than a cache line?
Is the answer affected by the use of multiple threads (so despite accessing the same memory they may be running on different cores with different caches)?

Cache performance is pretty complex, and the really reliable answers are going to come from hardware designers or operating system developers who work specifically with scheduling an dispatching. I used to work in performance analysis tools on large IBM systems, so I can give a partial, slightly-out-of-date answer:

First, cache memory is associative by address. If a piece of memory is addressed, the "cache line" for that address is loaded into cache. Depending on processor design, this could be 4, 8, 16, or 32 bytes in length. (Maybe more.) This will most likely be based on "alilgnment" of hardware addresses; in other words, a 32-byte line will be on a boundary that aligns with a divisible-by-32 address. Your memory reference may be in the beginning, middle, or end of that cache line.

Once it's in the cache, the address is used as a "lookup" to find the cached data.

Locality of reference will help you if the cache line is large enough that an "adjacent" item is referenced that happens to have been cached as part of the cache line. Jumping through your array will defeat this.

Cache designs vary widely based on vendor, product line, processor price, and a whole lot more. Perfect cache optimization is going to be highly elusive unless (1) you know a whole lot about the machine you're going to run on, and (2) you're really not interested in running on any other machine.

One other factor to consider is that 32-bit addresses are half the size of 64-bit addresses, and this has a significant effect on how much data can be cached. Giving more bits to addresses means fewer bits for data, more-or-less.

Prefetching is more witchcraft than science. Fetching memory from data to cache is expensive, even when it's asynchronous from processor execution (although it can't ever be too far separated from execution). Locality of reference is a good rule, although it's going to be based on hardware architecture in ways that don't necessarily match code execution on the micro scale. LRU (least recently used) is a common method of deciding what to boot out of cache, but removing something from cache to make room for something that ends up not being used ever is not such a good optimization. So prefetching will be judicious, to say the least.

EDIT: virtual memory issues, task switching, etc.

Virtual memory certainly makes things far more interesting, especially in operating systems that support multiple address spaces. Cache is most likely to be based on real addresses, not virtual addresses, so things like page swaps can have interesting side-effects on caching. Typically, a page that is due to be swapped out or released will first be invalidated, and moved to a "flush list" (where it can be written to the swap file), or a "free list". Depending on implementation, these pages can still be reclaimed by an application, but they're no longer addressable - meaning a page fault would occur in the process of reclaiming them. So once a page has been moved out of a app's working set, it's very likely that any cache lines associated with it would be invalidated. If the page isn't being heavily used, then it's not likely to have much in cache, either, but in a heavy swapping situation, cache performance can take a hit along with swapping performance.

Also, some cache designs have "shared" cache, and most or all have processor- and core-specific cache. Where cache is designated to a specific processor or core, and that core changes task, the entire cache is likely to be flushed to avoid corruption by a new process. This would not include thread-switching, since threads run in the same process and same address space. The real issue here is that high activity in other applications on a system can impact your cache performance. Shared cache aleviates this problem to some extent, but has to be more carefully managed to avoid corruptions.

Wikipedia seems to think that most caches are based on virtual addresses, "most modern level-1 caches are virtually indexed", http://en.wikipedia.org/wiki/CPU_cache#Address_translation

Joseph Garvin 2009-10-28 21:03:59

Whoever in Wikipedia that said that L1 caches are mostly virtually indexed is not correct. Most L1 data caches are indeed physically addressed. Using virtual addresses as your cache index requires the processor (and/or operating system) go through all sorts of hoops when you change your page table. It means that when page tables are modified (specifically when entries are changed or removed) then you need some strategy to clean out invalidated entries from your virtually-indexed caches.

boiler96 2009-10-28 21:30:54

If they're not virtually indexed, doesn't that throw away most of the programmer's ability to reason about cache effects? If I try to use a 'contiguous' block of memory in my app (say, iterating an array instead of a linked list) for better prefetching, the corresponding physical addresses may not be contiguous at all.

Joseph Garvin 2010-01-12 13:14:59

@Joseph, I wouldn't worry about this. Physical memory is generally allocated in pages. As far as I know this is still 4KB (if that changes, it will only grow). Cache lines will be much, much smaller, typically 8, 16, or 32 bytes. Always, the best strategy is to maintain locality of reference, and these numbers show you why. A small array will often be contained in a single page. A large array will cover a number of pages, but that number is much smaller than the number of array elements, and locality of reference still dominates. Mostly, just trust the cache designer to "do the right thing."

Cylon Cat 2010-01-12 15:37:30

ansaurus

tags:

views:

answers:

Are modern CPU caches optimized to deal with constant strides? Across threads?

related questions