views:

144

answers:

6

I've been reading Ulrich Drepper's, "What every programmer should know about memory" and in section 3.3.2 Measurements of Cache Effects ( halfway down the page ) it gives me the impression that accessing any member of a struct causes the whole struct to get pulled into the CPU cache.

Is this correct? If so, how does the hardware know about the layout of these structs? Or does the code generated by the compiler somehow force the entire struct to be loaded?

Or are the slowdowns from using larger structs primarily due to TLB misses caused by the structs being spread out across more memory pages?

The example struct used by Drepper is:

  struct l {
    struct l *n;
    long int pad[NPAD];
  };

Where sizeof(l) is determined by NPAD equals 0, 7, 15 or 31 resulting in structs that are 0, 56, 120, and 248 bytes apart and assuming cache lines that are 64 bytes and 4k pages.

Just iterating through the linked list gets significantly slower as the struct grows, even though nothing other than the pointer is actually being accessed.

+6  A: 

The hardware doesn't know the layout of the struct, but just loads a number of bytes around the accessed member into cache. And yes, the slowdown from larger structs is because they will then be spread across more cache lines.

Thomas Padron-McCarthy
This is exactly right. The fancy concept here is locality of reference.
Jason
So when you say "spread across more cache lines" you mean that part of the slow down results from prefetching of unused portions of the surrounding struct, in addition to TLB misses.
Robert S. Barnes
@Robert: I think it could apply in two different ways: 1. A single struct is so large that it doesn't fit comfortably in a single cache page. If you "touched it all over" (sounds dirty) it would probably cause multiple page fetches. 2. With a larger struct you simply get fewer structs into cache with any single cache page read, thereby increasing the probability that the *next* struct you want isn't in cache. A fundamental problem here is *locality of reference*. If you hopscotch around memory, you're going to have more cache misses. Understand your access pattern and design accordingly.
Peter Rowell
+1  A: 

Usually, the L1 cache uses virtual addresses, if you access a member of a struct, a specific amount of bytes gets in the cache (one cache line, size usually between 8 and 512 bytes). Since all struct members are aligned side by side in memory, the chance that the whole structure gets in the cache is somewhat big (depends on sizeof(struct your_struct))...

Johannes Weiß
+3  A: 

Accessing a struct member causes no more of a performance penalty than accessing any other area in memory. In fact, there may be a performance improvement if you access several struct members in the same area, since other members might be cached by the first access.

Richard Pennington
+4  A: 

The hardware does not know at all about the struct. But it is true that the hardware load in the cache some bytes around the bytes you are actually accessing. This is because the cache line has a size. It does not work on a byte by byte access but on e.g. 16 bytes size at a time.

You have to be careful when ordering the members of the struct so that often used members are close to each other. For instance if you have the following struct:

struct S {
  int foo;
  char name[64];
  int bar;
};

If the member variables foo and bar are used very often, the hardware will load in cache the bytes around foo, and when you'll access bar, it will have to load the bytes around bar. Even if these bytes around foo and around bar are never used. Now rewrite your struct as follows:

struct S {
  int foo;
  int bar;
  char name[64];
};

When you'll use foo, the hardware will load in cache the bytes around foo. When you'll use bar, bar will already be in the cache because bar is contained in the bytes around foo. The CPU won't have to wait for bar to be in the cache.

Answer is: accessing a single struct member does not pull the entire struct in the cache, but pull some other member of the struct into the cache.

Didier Trosset
+1  A: 

While the CPU can happily deal with loads and stores as small as one byte, caches only ever deal with "cacheline" sized data. In computer architecture textbooks this is also known as the "block size."

On most systems this is 32 or 64 bytes. It can be different from one CPU to the next, and even sometimes from one cache level to the next.

Additionally, some CPUs perform speculative prefetching; this means if you access cachelines 5 and 6 in sequence, it will attempt to load cacheline 7 without you asking for it.

Eric Seppanen
+1  A: 

"Just iterating through the linked list gets significantly slower as the struct grows, even though nothing other than the pointer is actually being accessed."

With NPAD = 0, each cache line contains 8 list nodes, so you can see why that's fastest.

With NPAD = 7, 15, 31, only one cache line needs to be loaded for each list node, and you might expect them all to be the same speed - one cache miss per node. But a modern memory manager will be doing speculative caching. If it has spare capacity (which it probably does, because with modern memory it can perform multiple reads in parallel to main memory), then it will start loading memory close to the memory you're using. Although it's a linked list, if you constructed it in any of the obvious ways then there is a good chance that you're accessing memory in sequence. So the closer together in memory your lists nodes are, the more successful the cache is likely to be in terms of already having what you need.

In the worst possible scenario, where your memory is being pulled in from swap as you use it, your program would be limited by disk I/O. It's possible that your rate of progress through the list would be entirely determined by how many nodes there are per page, and you might see time taken being directly proportional to the size of the node, up to 4k. I haven't tried it, though, and the OS will be clever with swap just as the MMU is clever with main memory, so it's not necessarily that simple.

Steve Jessop