views:

335

answers:

3

I have an implementation of a class X, that has two pointers to two pieces of information. I have written a new implementation, class Y, that has only one pointer to a struct that contains the two pieces of information together as adjacent members. X's and Y's methods usually only need to manipulate one of the pieces of information, but provide a get() method that returns a pointer to the second piece (in this case class X just returns its pointer to that piece and class Y returns the address of the struct's second member). In normal usage, calls to X's and Y's methods will happen interspersed by calls to get() and doing work on that returned second piece.

I expect that in real life situations there should be a performance improvement, now that the two pieces of information are next to one another in memory in the class Y implementation (because they are adjacent members of a struct), but I'm not seeing any difference in the benchmarks I've written (interspersing calls to X's and Y's methods with doing work on their second pieces in big loops). I suspect this is because everything fits in cache in either case in my tests. I don't want to try this in my real app yet because the semantics of X and Y differ in other subtle ways not related to this optimization and porting the using application will be some work, and these benchmarks are supposed to help justify doing that work in the first place.

What's the best way to observe the difference in performance due to better cache locality? If I do a bunch of dummy work on an array equal to the size of the cache in between calls is that sufficient? Or do I want to do work on an array slightly less than the cache size, so that work on my instances of my class will cause things to fall in and out of cache? I'm not sure how to code something that is robust against compiler optimizations and different cache sizes.

A: 

If I'm understanding your situation correctly (and please correct me if not), then it's six of one, or half a dozen of the other.

In class X, you need one pointer lookup for either piece of information. In class Y, you need one lookup for the first, and two (get the first and then offset) for the second. That's sacrificing "locality" for another memory access. Compilers are still, unfortunately, very good at wasting bus time looking up words in RAM.

If it's possible, you'll get the best results by holding the two pieces of target information directly within the class in question (i.e. each it's own class member), rather than using those pointers for unnecessary indirection. Not seeing any code, that's pretty much all I can say.

At any rate, you'll get a lot more performance out of studying the algorithmic complexity of your application than you ever will with micro-optimizing two variables in a class definition. Also a great idea is to use a profiling tool to see (objectively) where your bottlenecks are (gprof is common on *nix systems). Is there a distinct reason you're looking to increase locality caching specifically?

Chris
'Why' isn't really the issue here - the question is quite clearly to benchmark cache locality. I don't think 'why' really adds anything to the discussion, and its best to assume Joseph knows what he is doing.
Justicle
The "why" is always important, at least IMHO. "I expect that in real life situations there should be a performance improvement" which tells me Joseph is looking to speed things up. "I don't want to try this in my real app yet" which suggests even more heavily that his end goal is better performance, and he's trying to go about it via improved locality - which is why I recommended other courses to improved performance. However, @Joseph, if I went the wrong direction here, please disregard. ;-) [And in that case, cachegrind is what you want]
Chris
I'm writing a smart pointer class that is basically algorithm-less. I've optimized it with g-prof down to the point where things like whether a branch exists (an if) or a spurious integer assignment can determine whether my class beats the old implementation. This is one of the few instances where micro-optimizations definitely apply ;)
Joseph Garvin
+7  A: 

If you are on Linux, then using Cachegrind in conjunction with KCacheGrind might provide more insight as to what how your cache is behaving.

Soo Wei Tan
+1  A: 

You could design a benchmark specifically to bust the cache. For instance, allocate the pointed-to data blocks such that they're all guaranteed to be on different cache lines (say, by using a custom memory allocator that pads allocations out to at least a few hundred bytes). Then repeatedly iterate over a number of objects too big to fit everything in even the L2 cache (very platform-dependent, since it depends on the number of lines in cache, but 1 million would cover most architectures and only require a few hundred meg RAM total).

This will give you an upper limit on the performance gain made by the change from X to Y. But it does it by degrading the performance of X down to below any likely real-world usage. And to prove your case you need a lower-limit estimate, not an upper-limit estimate. So I'm not sure you'd achieve much, unless you discover that even this worst case still makes no significant difference and you needn't bother with the optimization.

Even if you don't aim for theoretical worst-case performance of X, any benchmark designed to exceed the cache is just picking an arbitrary point of bad performance of X, and looking to see if Y is better. It's not far off rigging the benchmark to make Y look good. It really doesn't matter how your code performs in dodgy benchmarks, except maybe for the purposes of marketing lies literature.

The best way to observe the real-world difference in performance, is to measure a real-world client of your class. You say that "the semantics of X and Y differ in other subtle ways not related to this optimization", in which case I can only recommend that you write a class Z which differs from X only in respect of this optimization, and use that in your application as the comparison.

Once your tests attempt to represent the worst realistic use, then if you aren't seeing any difference in performance there's probably no performance gain to be had.

All that said, if it makes logical sense (that is, it doesn't make the code any more astonishing), then I would advocate minimising the number of heap allocations in C++ simply as a rule of thumb. It doesn't tend to make speed or total memory usage worse, and it does tend to simplify your resource handling. A rule of thumb doesn't justify a re-write of working code, of course.

Steve Jessop