views:

485

answers:

6
  1. In the application we have something about 30 types of objects that are created repeatedly.
  2. Some of them have long life (hours) some have short (miliseconds).
  3. Objects could be created in one thread and destroyed in another.

Does anybody have any clue what could be good pooling technique in the sense of minimal creation/destruction latency, low lock contention and reasonable memory utilization?

Append 1.

1.1. Object pool/memory allocations for one type usually is not related to another type (see 1.3 for an exception)

1.2. Memory allocation is performed for only one type (class) at time, usually for serveral objects at time.

1.3. If a type aggregates another type using pointer (for some reason) these types allocated together in the one continuos piece of memory.

Append 2.

2.1. Using a collection with access serialization per type is known to be worse than new/delete.

2.2. Application is used on different platforms/compilers and cannot use compiler/platform specific tricks.

Append 3.

It becomes obvious that the fastest (with lowest latency) implementation should organize object pooling as star-like factories network. Where the central factory is global for other thread specific factories. Regular object provision/recycling is more effective to do in a thread specific factory while the central factory could be used for object balancing between threads.

3.1. What is the most effective way to organize communications between the central factory and thread specific factories?

+1  A: 

If you haven't looked at tcmalloc, you might want to take a look. Basing your implementation off of its concepts might be a good start. Key points:

  • Determine a set of size classes. (Each allocation will be fulfilled by using an entry from an equal or greater sized allocation.)
  • Use one size-class per page. (All instances in a page are the same size.)
  • Use per-thread freelists to avoid atomic operations on every alloc/dealloc
  • When a per-thread freelist is too large, move some of the instances back to the central freelist. Try to move back allocations from the same page.
  • When a per-thread freelist is empty, take some from the central freelist. Try to take contiguous entries.
  • Important: You probably know this, but make sure your design will minimize false sharing.

Additional things you can do that tcmalloc can't:

  • Try to enable locality of reference by using finer-grained allocation pools. For example, if a few thousand objects will be accessed together, then it is best if they are close together in memory. (To minimize cache missed and TLB faults.) If you allocate these instances from their own threadcache, then they should have fairly good locality.

  • If you know in advance which instances will be long-lived and which will not, then allocate them from separate thread caches. If you do not know, then periodically copy the old instances using a threadcache for allocation and update old references to the new instances.

0124816
+2  A: 

To minimize construct/destruct latency, you need fully constructed objects at hand, so you will eliminate the new/ctor/dtor/delete time. These "free" objects can be kept in a list so you just pop/push the element at the end.

You may lock the object pools (one for each type) one by one. It is a bit more efficient than a system-wide lock, but does not have the overhead of a by-object locking.

Denes Tarjan
If ctor and dtor do not allocate or free memory, then they should be fast, and thus there should be no reason to avoid them.Also, the issue with having one freelist per-type is that you must use atomic ops to manipulate it, and it will encourage false sharing.
0124816
Not sure that this is the best answer. IMHO the one about tcmalloc was better. Cannot vote though
+3  A: 

I assume you have profile and measured your code after doing all that creation and verified that create/destroy is actually causing an issue. Else this is what you should do first.

If you still want to do the object pooling, as a first step, you should ensure your objects are stateless coz, that would be the prerequisite for reusing an object. Similarly you should ensure the members of the object and the object itself has no issue with being used from a different threads other than the one which created it. (COM STA objects / window handles etc)

If you use windows and COM, one way to use system provided pooling would be to write Free Threaded objects and enable object pooling, which will make the COM+ run time (earlier known as MTS) do this for you. If you use some other platform like Java perhaps you could use application servers that define interfaces that your objects should implement and the COM+ server could do the pooling for you.

or you could roll your own code. But you should try to find if there is pattern for this and if yes use that instead of what follows below

If you need to roll your own code, create a dynamically growable collection which tracks the objects already created. Use a vector preferrably for the collection since you would only be adding to the collection and it would be easy to traverse it searching for a free object. (assuming you do not delete objects in pool). Change the collection type according to your delete policies (vector of pointers/references to objects if you are using C++ so that delete and recreate an object at the same location)

Each object should be tracked using a flag which can be read in a volatile manner and changed using an interlock function to mark it as being used/ not used.

If all objects are used, you need to create a new object and add it to the collection. Before adding, you can acquire a lock (critical section), mark the new object as being used and exit the lock.

Measure and proceed - probably if you implemented the above collection as a class you could easily create different collections for different object types so as to reduce lock contention from threads that do different work.

Finally you could implement an overloaded class factory interface that can create all kinds of pooled objects and knows which collection holds which class

You could then optimize on this design from there.

Hope that helps.

A: 

If you have some guess of the preferred size of the pool you can create fixed size pool using stack structure using array (the fastest possible solution). Then you need to implement four phases of object life time hard initialization (and memory allocation), soft initialization, soft cleanup and hard cleanup (and memory release). Now in pseudo code:

Object* ObjectPool::AcquireObject()
{
    Object* object = 0;
    lock( _stackLock );
    if( _stackIndex )
       object = _stack[ --_stackIndex ];
    unlock( _stackLock );
    if( !object )
       object = HardInit();
    SoftInit( object );
}

void ObjectPool::ReleaseObject(Object* object)
{
     SoftCleanup( object );
    lock( _stackLock );
    if( _stackIndex < _maxSize )
    {
       object = _stack[ _stackIndex++ ];
       unlock( _stackLock );
    }
    else
    {
       unlock( _stack );
       HardCleanup( object );
    }
}

HardInit/HardCleanup method performs full object initialization and destruction and they are executed only if the pool is empty or if the freed object cannot fit the pool because it is full. SoftIniti performs soft initialization of objects, it initializes only those aspect of objects that can be changed since it was released. SoftCleanup method free resources used by the object which should be freed as fast as possible or those resources which can become invalid during the time its owner resides in the pool. As you can see locking is minimal, only two lines of code (or only few instructions).

These four methods can be implemented in separate (template) classes so you can implement fine tuned operations per object type or usage. Also you may consider using smart pointers to automatically return object to its pool when it is no longer needed.

Mladen Jankovic
Issues with this are that you will see a lot of contention for the stack lock. Also, false sharing will exist, as allocations from separate threads will adjacent. In addition, if 3000 temporaries are allocated, and then 1 permanent object is allocated, the 3000 temporaries will not be reclaimbed.
0124816
I don't get last part about note reclaiming temporary object. Can you be more specific?
Mladen Jankovic
I misread the code and misunderstood the answer; the 'no reclaim temporary' part was entirely wrong. :-) However, the above could be improved to use a linked list, thus removing the lock. (Instead use atomic list ops.)
0124816
A: 

Have you tried the hoard allocator? It provides better performance than the default allocator on many systems.

Anthony Williams
Thank you.It was interesting reading. However our guys told me that it has heavy locks contention in compare to umem on Solaris.
A: 

Why do you have multiple threads destroying objects they did not create? It's a simple way to handle object lifetime, but the costs can vary widely depending on use.

Anyways, if you haven't started implementing this yet, at the very least you can put the create/destroy functionality behind an interface so that you can test/change/optimize this at a later date when you have more information about what your system actually does.

MSN

MSN
The concept is slightly different. An object could travel through threads handling it. And when its life time will has came to the end it could be in any other thread than the mother-thread. Of course after-life management details are hidden..