views:

224

answers:

5

It is safe to read a STL container from multiple parallel threads. However, the performance is terrible. Why?

I create a small object that stores some data in a multiset. This makes the constructors fairly expensive ( about 5 usecs on my machine. ) I store hundreds of thousands of the small objects in a large multiset. Processing these objects is an independent business, so I split the work between threads running on a multi-core machine. Each thread reads the objects it needs from the large multiset, and processes them.

The problem is that the reading from the big multiset does not proceed in parallel. It looks like the reads in one thread block the reads in the other.

The code below is the simplest I can make it and still show the problem. First it creates a large multiset containing 100,000 small objects each containing its own empty multiset. Then it calls the multiset copy constructor twice in series, then twice again in parallel.

A profiling tool shows that the serial copy constructors take about 0.23 secs, whereas the parallel ones take twice as long. Somehow the parallel copies are interfering with each other.

// a trivial class with a significant ctor and ability to populate an associative container
class cTest
{
    multiset<int> mine;
    int id;
public:
    cTest( int i ) : id( i ) {}
    bool operator<(const cTest& o) const { return  id < o.id;  }
};
// add 100,000 objects to multiset
void Populate( multiset<cTest>& m )
{
    for( int k = 0; k < 100000; k++ )
    {
     m.insert(cTest(k));
    }
}
// copy construct multiset, called from mainline
void Copy( const multiset<cTest>& m )
{
    cRavenProfile profile("copy_main");
    multiset<cTest> copy( m );
}
// copy construct multiset, called from thread
void Copy2( const multiset<cTest>& m )
{
    cRavenProfile profile("copy_thread");
    multiset<cTest> copy( m );
}
int _tmain(int argc, _TCHAR* argv[])
{
    cRavenProfile profile("test");
    profile.Start();

    multiset<cTest> master;

    Populate( master );

    // two calls to copy ctor from mainline
    Copy( master );
    Copy( master );

    // call copy ctor in parrallel
    boost::thread* pt1 = new boost::thread( boost::bind( Copy2, master ));
    boost::thread* pt2 = new boost::thread( boost::bind( Copy2, master ));

    pt1->join();
    pt2->join();

    // display profiler results
    cRavenProfile print_profile;

    return 0;
}

Here is the output

            Scope   Calls       Mean (secs)     Total
      copy_thread        2      0.472498        0.944997
        copy_main        2      0.233529        0.467058
A: 

What's the scheduling of your threads? If you run two threads, doing the considerable work, the threads most likely start at once and end at once. Hence the profiler thinks that the execution of each thread has taken twice as much time, because during the time each thread executes the work is done twice. Whereas the execution of each of sequential calls took the normal time.

        step 0 1 2 3 4 5 6 7 8 9
threaded:    1,2,1,2,1,2,1,2,1,2
sequential:  1,1,1,1,1,2,2,2,2,2

Thread one started at 0 and ended at 8, showing execution time as 8; thread 2 started at 1 and ended at 9, the execution time is 8. Two sequential runs show 5 steps each. So in the resultant table you'll see 16 for concurrent version and 10 for the sequential one.

Assuming that all above is true and there is a considerable number of steps, the ratio of the execution times shown by your profiler should be about two. Experiment does not contradict this hypothesis.

Pavel Shved
This would be true on a single core machine. On a multiple core machine, the threads run in parallel and, theoretically, take the same amount of wall clock time.
ravenspoint
@ravenspoint. Your assume the threads will get assigned to different cores. Maybee. Also is the processor hyper threaded?
Martin York
I bet you're not running on multiple cores. Not only your machine should be multi-core, but your system should support it (and it should map different threads of single process to different cores). Check on your system monitor whether threaded program indeed uses several cores. Recompile your kernel with proper options if it doesn't.
Pavel Shved
I am running on multiple cores. The Windows task manager performance tab shows both cores running.For most of my code, both cores run 100%. During the copy, one is at 100%, the other about 25% to 50%.
ravenspoint
where in boost::thread does it say it will leverage multiple processors? Threads commonly share the same memory space within a single process.
DanM
Process != processors. All current operating systems will assign multiple cores and even multiple processors to a single process, if they would otherwise be idle. Last mainstream OS not to do so was Windows ME.
MSalters
A: 

Since I am not sure how your profiler is working it is hard to tell.
What I would prefer to see is some explicit timing around the code:
Then do the work a couple of times to average out any thing that causing a context switch.

for(int loop=0;loop < 100;++loop)
{
   ts = timer();
   Copy( master );    
   Copy( master );
   te = timer();
   tt += te - ts;
}
tt /= 100;

etc Compare this with your profiler results.

Martin York
The profiler is not the issue. The effect is so large, I can see it by timing the program with my wristwatch. I just added the profiler to make it more obvious what is happening, and get some output to post here.
ravenspoint
+8  A: 

You mentioned copy constructors. I assume that these also allocate memory from the heap?

Allocating heap memory in multiple threads is a big mistake.

The standard allocator is probably a single pool locked implementation. You need to either not use heap memory (stack allocate) or you need a thread optimized heap allocator.

Zan Lynx
This sounds interesting!
ravenspoint
I believe this is the answer.Stack allocation seems like a bad idea - I'd likely get stack overflows.Can you point me to a thread optimized heap allocator that will slide easily into STL?
ravenspoint
This a good theory (regardless of whether he has two physical cores or just hyperthreading, this issue would cause a slowdown). Hoard is a popular multi-threaded allocator I'd recommend.
wrang-wrang
http://stackoverflow.com/questions/147298/multithreaded-memory-allocators-for-c-c
ravenspoint
I tried using hoard http://www.hoard.org/ This gave a dramatic doubling of performance. However the threaded code still ran twice as slow as the serial.
ravenspoint
This answers my question: "Why?"Now I need an answer to "How to fix?", but let's leave that for after the holiday weekend.
ravenspoint
Note that STL containers all take an allocator argument. That means you can use an allocator which (A) effectively supports multiple threads and (B) is optimized to allocate objects of the same size. Such allocators can be far more efficient because they can be bitmap-based. You might also be able to optimize your allocator using special-case logic (e.g. only 8 threads)
MSalters
A: 

To answer Pavel Shved with more detail, here is how the majority of my code runs:

        step 0 1 2 3 4 5 6 7 8 9
core1:       1 1 1 1 1
core2:       2,2,2,2,2
sequential:  1,1,1,1,1,2,2,2,2,2

Only the parallel reads interfere with each other.

As an experiment, I replaces the big multiset with an array of pointers to cTest. The code now has huge memory leaks, but never mind. The interesting thing is that the relative performance is worse - running the copy constructors in parallel slows them down 4 times!

Scope   Calls       Mean (secs)     Total
    copy_array_thread        2      0.454432        0.908864
      copy_array_main        2      0.116905        0.233811
ravenspoint
Try allocating and destroying (via new/delete) your profiler objects manually. Maybe it badly interferes with `join()`s?
Pavel Shved
A: 

OK, after spending the majority of the week on this issues, I have the fix.

There were two problems with the code I posted in the question:

  1. boost::bind makes a copy of its parameters, even if the underlying function uses call by reference. Copying the container is expensive, and so the multi-threaded version was working too hard. ( No-one noticed this! ) To pass the container by reference, I needed to use this code:

    boost::thread* pt1 = new boost::thread( boost::bind( Copy2, boost::cref(master) ));

  2. As Zan Lynx pointed out the default container allocates memory for its contents on the global heap using a thread safe singleton memory allocator, resulting in great contention between the threads as they created hundreds of thousands of objects through the same allocator instance. ( Since this was the crux of the mystery, I accepted Zan Lynx's answer. )

The fix for #1 is straightforward, as presented above.

The fix for #2 is, as several people pointed out, to replace the default STL allocator with a thread specific one. This is quite the challenge, and no-one offered a specific source for such an allocator.

I spent some time looking for a thread specific allocator "off the shelf". The best I found was hoard ( hoard.org ). This provided a significant performance improvement, however hoard has some serious drawbacks

  • I experienced some crashes during testing
  • Commercial licensing is expensive
  • It 'hooks' system calls to malloc, a technique I consider dodgy.

So I decided to roll my own thread specific memory allocator, based on boost::pool and boost::threadspecificptr. This required a small amount of, IMHO, seriously advanced C++ code, but now seems to be working well.

ravenspoint