views:

218

answers:

4

Hi all,

I'm trying to use a faster memory allocator in C++. I can't use Hoard due to licensing / cost. I was using NEDMalloc in a single threaded setting and got excellent performance, but I'm wondering if I should switch to something else -- as I understand things, NEDMalloc is just a replacement for C-based malloc() & free(), not the C++-based new & delete operators (which I use extensively).

The problem is that I now need to be thread-safe, so I'm trying to malloc an object which is reference counted (to prevent excess copying), but which also contains a mutex pointer. That way, if you're about to delete the last copy, you first need to lock the pointer, then free the object, and lastly unlock & free the mutex.

However, using malloc to create a boost::mutex appears impossible because I can't initialize the private object as calling the constructor directly ist verboten.

So I'm left with this odd situation, where I'm using new to allocate the lock and nedmalloc to allocate everything else. But when I allocate a large amount of memory, I run into allocation errors (which disappear when I switch to malloc instead of nedmalloc ~ but the performance is terrible). My guess is that this is due to fragmentation in the memory and an inability of nedmalloc and new to place nice side by side.

There has to be a better solution. What would you suggest?

+2  A: 

Have you profiled and verified that actual memory allocation is a significant enough problem that replacing the allocator provides useful gain?

Is NEDMalloc thread safe?

Often, the default c++ new/delete operators will use malloc and free under the hood to do the actual memory allocation before/after calling the constructor/destructor. If they don't in your particular situation, you can override the global new and delete operators to call whatever allocation implementation you wish. This requires some care making sure that memory is always allocated/deallocated with the same allocator (especially when dealing with libraries).

Mark B
M. Tibbits
+3  A: 

Google's malloc replacement is quite fast, thread safe by default, and easy to use. Simply link it into your application at it will replace the behavior or malloc/free and new/delete. This makes it particularly easy to re-profile your app to verify the new allocator is actually speeding things up.

caspin
+1  A: 

Well, usually C++ new and delete operators internally calls plain C library functions malloc and free (plus some additional magic like calling ctors and dtors), so providing a custom implementation for these functions may be enough (this is not infrequent in embedded C++ development, but requires some linker-level work). What system and what compiler are you targeting?

Lorenzo
M. Tibbits
just had to follow up -- love tcmalloc, it wasn't working until recently with gcc 4.5, but both were just patched and things are singing now. The only build I don't have working is the 32 bit build on a 64 bit native Win 7 in VS 2010 -- but that may not be tcmalloc's fault. __Update:__ nevermind CUDA 3.2 Beta wasn't playing nice in ways I didn't suspect. Tcmalloc was never to blame.
M. Tibbits
+2  A: 

You can overload global operators new and delete to call the new versions of malloc and free that you're using. This should make things play nicer together, though I'd be surprised if this wasn't happening already.

As for creating the mutex, use placement new -- this is how a constructor is called manually. A static array of char will do by way of buffer. For example, globals:

static char buf[sizeof(Mutex)];
static Mutex *m=0;

Then to initialize the m pointer:

m=new(buf) Mutex;

(You can also align the pointer, and so on, if you need to, and rename the variables, and so on.)

One thing that might be worth noting is that if the Mutex constructor does more memory allocation itself then this can be a problem. This is unlikely, but possible. (For this likely-to-be-rare case, there's usually no problem with an ad-hoc implementation of a cross-platform mutex wrapper, that doesn't do any allocation -- or, though it will end up a mess eventually, just use #ifdef and use the platform types directly. In either case, it's not much code, and anybody experienced with the system(s) in question can create the relevant code, bug-free, in very little time.)

Correct cleanup of objects created this way can be difficult, so I recommend not to bother (no, seriously). It's perfectly OK to let this stuff leak when you're using it to implement the memory manager; no point going mad over it. (If you're working on a system that has a notion of process exit, the OS is pretty much guaranteed to clean up the underlying mutex for you.)

brone
But this has the side effect (it might be a wanted side effect) that you have to provide a new overload for each class you want to manage with the custom allocator.
Lorenzo
There's no custom allocator; placement new is standard, and global operators new and delete are being overridden (if that is even necessary). I did think of one issue though and will edit the answer accordingly.
brone
I can't mark two solutions. I'm going to try this after the tcmalloc. Thanks!
M. Tibbits