To expand on the previous answers a bit:
Usually, as programmers, we can think of our programs' addressable memory as a flat array of bytes, from 0x00000000 to 0xFFFFFFFF. The operating system will reserve some of those addresses (all the ones lower than 0x800000000, say) for its own use, but we can do what we like with the others. All those memory locations live in the computer's RAM, and when we want to read from them or write to them we issue the appropriate instructions.
But this isn't true! There are a bunch of complications tainting that simple model of process memory: virtual memory, swapping, and the cache.
Talking to RAM takes a fairly long time. It's much faster than going to the hard disk, as there aren't any spinning plates or magnets involved, but it's still pretty slow by the standards of a modern CPU. So, when you try to read from a particular location in memory, your CPU doesn't just read that one location into a register and call it good. Instead, it reads that location, /and a bunch of nearby locations/, into a processor cache that lives on the CPU and can be accessed much more quickly than main memory.
Now we have a more complicated, but more correct, view of the computer's behavior. When we try to read a location in memory, first we look in the processor cache to see if the value at that location is already stored there. If it is, we use the value in the cache. If it isn't, we take a longer trip into main memory, retrieve the value as well as several of its neighbors and stick them in the cache, kicking out some of what used to be there to make room.
Now we can see why the second code snippet is faster than the first. In the second example, we first access a[0]
, b[0]
, and c[0]
. Each of those values is cached, along with their neighbors, say a[1..7]
, b[1..7]
, and c[1..7]
. Then when we access a[1]
, b[1]
, and c[1]
, they're already in the cache and we can read them quickly. Eventually we get to a[8]
, and have to go to RAM again, but seven times out of eight we're using nice fast cache memory instead of clunky slow RAM memory.
(So why don't accesses to a
, b
, and c
kick each other out of the cache? It's a bit complicated, but essentially the processor decides where to store a given value in the cache by its address, so three objects that aren't near each other spatially are unlikely to be cached into the same location.)
By contrast, consider the first snippet from lbrandy's post. We first read a[0]
, b[0]
, and c[0]
, caching a[1..7]
, b[1..7]
, and c[1..7]
. Then we access a[width]
, b[width]
, and c[width]
. Assuming width is >= 8 (which it probably is, or else we wouldn't care about this sort of low-level optimization), we have to go to RAM again, caching a new set of values. By the time we get to a[1]
, it will probably have been kicked out of the cache to make room for something else. In the not-at-all-uncommon case of a trio of arrays that are larger than the processor cache, it's likely that /every single read/ will miss the cache, degrading performance enormously.
This has been a very high-level discussion of modern caching behavior. For something more in-depth and technical, this looks like a thorough yet readable treatment of the subject.