views:

97

answers:

3

I'm building an application which will have dynamic allocated objects of type A each with a dynamically allocated member (v) similar to the below class

class A {
int a;
int b;
int* v;
};

where:

  • The memory for v will be allocated in the constructor.
  • v will be allocated once when an object of type A is created and will never need to be resized.
  • The size of v will vary across all instances of A.

The application will potentially have a huge number of such objects and mostly need to stream a large number of these objects through the CPU but only need to perform very simple computations on the members variables.

  • Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?
  • What tools and techniques can be used to test if this fragmentation is a performance bottleneck?
  • If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?
  • Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.
  • If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?

The target platforms are standard desktop machines with x86/AMD64 processors, Windows or Linux OSes and compiled using either GCC or MSVC compilers.

+1  A: 

Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?

Yes, it that will be likely.

What tools and techniques can be used to test if this fragmentation is a performance bottleneck?

cachegrind, shark.

If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?

Yes, you could allocate them together, but you should probably see if it's an issue first. You could use arena allocation, for example, or write your own allocators.

Or are there any techniques to aid memory access such as pre-fetching scheme? for example get an object of type A operate on the other member variables whilst pre-fetching v.

Yes, you could do this. The best thing to do would be to allocate regions of memory used together near each other.

If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?

It might or might not. It would at least make v local with the struct members.

  1. Write code.
  2. Profile
  3. Optimize.
WhirlWind
+2  A: 

If you have a good reason to care about performance...

Could having v dynamically allocated could mean that an instance of A and its member v are not located together in memory?

If they are both allocated with 'new', then it is likely that they will be near one another. However, the current state of memory can drastically affect this outcome, it depends significantly on what you've been doing with memory. If you just allocate a thousand of these things one after another, then the later ones will almost certainly be "nearly contiguous".

If the A instance is on the stack, it is highly unlikely that its 'v' will be nearby.

If such fragmentation is a performance issue, are there any techniques that could allow A and v to allocated in a continuous region of memory?

Allocate space for both, then placement new them into that space. It's dirty, but it should typically work:

char* p = reinterpret_cast<char*>(malloc(sizeof(A) + sizeof(A::v)));
char* v = p + sizeof(A);
A* a = new (p) A(v);

// time passes

a->~A();
free(a);

Or are there any techniques to aid memory access such as pre-fetching scheme?

Prefetching is compiler and platform specific, but many compilers have intrinsics available to do it. Mind- it won't help a lot if you're going to try to access that data right away, for prefetching to be of any value you often need to do it hundreds of cycles before you want the data. That said, it can be a huge boost to speed. The intrinsic would look something like __pf(my_a->v);

If the size of v or an acceptable maximum size could be known at compile time would replacing v with a fixed sized array like int v[max_length] lead to better performance?

Maybe. If the fixed size buffer is usually close to the size you'll need, then it could be a huge boost in speed. It will always be faster to access one A instance in this way, but if the buffer is unnecessarily gigantic and largely unused, you'll lose the opportunity for more objects to fit into the cache. I.e. it's better to have more smaller objects in the cache than it is to have a lot of unused data filling the cache up.

The specifics depend on what your design and performance goals are. An interesting discussion about this, with a "real-world" specific problem on a specific bit of hardware with a specific compiler, see The Pitfalls of Object Oriented Programming (that's a Google Docs link for a PDF, the PDF itself can be found here).

dash-tom-bang
Neat. I didn't know about placement new.
WhirlWind
"If they are both allocated with 'new', then it is likely that they will be near one another" - nope. If `A` is 16 bytes, and `v` is 512kB then it's highly unlikely that they will be near one another. And -1 for suggesting dirty hacks without knowing if the OP really needs it.
Kornel Kisielewicz
On the other side, if sizeof(v)=512 KB, then it isn't that important whether the 16 bytes of A are contiguous with that. And co-allocating space is not a dirty hack. Typically you'd wrap it in a static class member (public named ctor idiom).
MSalters
@Kornel - without more information from the OP how do we *know* what the OP is looking for. Suggesting options in the face of incomplete information is a major practice on SO, even moreso when the OP specifically asks for "advanced methods". Sorry you feel hurt by my comment to your post but the only wrong statement regarding programming is the absolute one.
dash-tom-bang
A: 

If you need to stream a large number of these through the CPU and do very little calculation on each one, as you say, why are we doing all this memory allocation?

Could you just have one copy of the structure, and one (big) buffer of v, read your data into it (in binary, for speed), do your very little calculation, and move on to the next one.

The program should spend almost 100% of time in I/O. If you pause it several times while it's running, you should see it almost every time in the process of calling a system routine like FileRead. Some profilers might give you this information, except they tend to be allergic to I/O time.

Mike Dunlavey