Cache performance is pretty complex, and the really reliable answers are going to come from hardware designers or operating system developers who work specifically with scheduling an dispatching. I used to work in performance analysis tools on large IBM systems, so I can give a partial, slightly-out-of-date answer:
First, cache memory is associative by address. If a piece of memory is addressed, the "cache line" for that address is loaded into cache. Depending on processor design, this could be 4, 8, 16, or 32 bytes in length. (Maybe more.) This will most likely be based on "alilgnment" of hardware addresses; in other words, a 32-byte line will be on a boundary that aligns with a divisible-by-32 address. Your memory reference may be in the beginning, middle, or end of that cache line.
Once it's in the cache, the address is used as a "lookup" to find the cached data.
Locality of reference will help you if the cache line is large enough that an "adjacent" item is referenced that happens to have been cached as part of the cache line. Jumping through your array will defeat this.
Cache designs vary widely based on vendor, product line, processor price, and a whole lot more. Perfect cache optimization is going to be highly elusive unless (1) you know a whole lot about the machine you're going to run on, and (2) you're really not interested in running on any other machine.
One other factor to consider is that 32-bit addresses are half the size of 64-bit addresses, and this has a significant effect on how much data can be cached. Giving more bits to addresses means fewer bits for data, more-or-less.
Prefetching is more witchcraft than science. Fetching memory from data to cache is expensive, even when it's asynchronous from processor execution (although it can't ever be too far separated from execution). Locality of reference is a good rule, although it's going to be based on hardware architecture in ways that don't necessarily match code execution on the micro scale. LRU (least recently used) is a common method of deciding what to boot out of cache, but removing something from cache to make room for something that ends up not being used ever is not such a good optimization. So prefetching will be judicious, to say the least.
EDIT: virtual memory issues, task switching, etc.
Virtual memory certainly makes things far more interesting, especially in operating systems that support multiple address spaces. Cache is most likely to be based on real addresses, not virtual addresses, so things like page swaps can have interesting side-effects on caching. Typically, a page that is due to be swapped out or released will first be invalidated, and moved to a "flush list" (where it can be written to the swap file), or a "free list". Depending on implementation, these pages can still be reclaimed by an application, but they're no longer addressable - meaning a page fault would occur in the process of reclaiming them. So once a page has been moved out of a app's working set, it's very likely that any cache lines associated with it would be invalidated. If the page isn't being heavily used, then it's not likely to have much in cache, either, but in a heavy swapping situation, cache performance can take a hit along with swapping performance.
Also, some cache designs have "shared" cache, and most or all have processor- and core-specific cache. Where cache is designated to a specific processor or core, and that core changes task, the entire cache is likely to be flushed to avoid corruption by a new process. This would not include thread-switching, since threads run in the same process and same address space. The real issue here is that high activity in other applications on a system can impact your cache performance. Shared cache aleviates this problem to some extent, but has to be more carefully managed to avoid corruptions.