ansaurus

Question

Answer 1

A:

What's the scheduling of your threads? If you run two threads, doing the considerable work, the threads most likely start at once and end at once. Hence the profiler thinks that the execution of each thread has taken twice as much time, because during the time each thread executes the work is done twice. Whereas the execution of each of sequential calls took the normal time.

        step 0 1 2 3 4 5 6 7 8 9
threaded:    1,2,1,2,1,2,1,2,1,2
sequential:  1,1,1,1,1,2,2,2,2,2

Thread one started at 0 and ended at 8, showing execution time as 8; thread 2 started at 1 and ended at 9, the execution time is 8. Two sequential runs show 5 steps each. So in the resultant table you'll see 16 for concurrent version and 10 for the sequential one.

Assuming that all above is true and there is a considerable number of steps, the ratio of the execution times shown by your profiler should be about two. Experiment does not contradict this hypothesis.

Pavel Shved 2009-09-04 19:48:07

This would be true on a single core machine. On a multiple core machine, the threads run in parallel and, theoretically, take the same amount of wall clock time.

ravenspoint 2009-09-04 19:52:12

@ravenspoint. Your assume the threads will get assigned to different cores. Maybee. Also is the processor hyper threaded?

Martin York 2009-09-04 20:09:08

I bet you're not running on multiple cores. Not only your machine should be multi-core, but your system should support it (and it should map different threads of single process to different cores). Check on your system monitor whether threaded program indeed uses several cores. Recompile your kernel with proper options if it doesn't.

Pavel Shved 2009-09-04 20:09:20

I am running on multiple cores. The Windows task manager performance tab shows both cores running.For most of my code, both cores run 100%. During the copy, one is at 100%, the other about 25% to 50%.

ravenspoint 2009-09-04 20:21:23

where in boost::thread does it say it will leverage multiple processors? Threads commonly share the same memory space within a single process.

DanM 2009-09-04 20:23:33

Process != processors. All current operating systems will assign multiple cores and even multiple processors to a single process, if they would otherwise be idle. Last mainstream OS not to do so was Windows ME.

MSalters 2009-09-07 11:14:39

Answer 2

A:

Since I am not sure how your profiler is working it is hard to tell.
What I would prefer to see is some explicit timing around the code:
Then do the work a couple of times to average out any thing that causing a context switch.

for(int loop=0;loop < 100;++loop)
{
   ts = timer();
   Copy( master );    
   Copy( master );
   te = timer();
   tt += te - ts;
}
tt /= 100;

etc Compare this with your profiler results.

Martin York 2009-09-04 20:13:22

The profiler is not the issue. The effect is so large, I can see it by timing the program with my wristwatch. I just added the profiler to make it more obvious what is happening, and get some output to post here.

ravenspoint 2009-09-04 20:24:22

Answer 3

+8 A:

You mentioned copy constructors. I assume that these also allocate memory from the heap?

Allocating heap memory in multiple threads is a big mistake.

The standard allocator is probably a single pool locked implementation. You need to either not use heap memory (stack allocate) or you need a thread optimized heap allocator.

Zan Lynx 2009-09-04 20:23:00

This sounds interesting!

ravenspoint 2009-09-04 20:33:12

I believe this is the answer.Stack allocation seems like a bad idea - I'd likely get stack overflows.Can you point me to a thread optimized heap allocator that will slide easily into STL?

ravenspoint 2009-09-04 20:38:34

This a good theory (regardless of whether he has two physical cores or just hyperthreading, this issue would cause a slowdown). Hoard is a popular multi-threaded allocator I'd recommend.

wrang-wrang 2009-09-04 20:42:32

http://stackoverflow.com/questions/147298/multithreaded-memory-allocators-for-c-c

ravenspoint 2009-09-04 20:45:57

I tried using hoard http://www.hoard.org/ This gave a dramatic doubling of performance. However the threaded code still ran twice as slow as the serial.

ravenspoint 2009-09-04 21:38:56

This answers my question: "Why?"Now I need an answer to "How to fix?", but let's leave that for after the holiday weekend.

ravenspoint 2009-09-04 22:17:28

Note that STL containers all take an allocator argument. That means you can use an allocator which (A) effectively supports multiple threads and (B) is optimized to allocate objects of the same size. Such allocators can be far more efficient because they can be bitmap-based. You might also be able to optimize your allocator using special-case logic (e.g. only 8 threads)

MSalters 2009-09-07 11:19:14

Answer 4

A:

To answer Pavel Shved with more detail, here is how the majority of my code runs:

        step 0 1 2 3 4 5 6 7 8 9
core1:       1 1 1 1 1
core2:       2,2,2,2,2
sequential:  1,1,1,1,1,2,2,2,2,2

Only the parallel reads interfere with each other.

As an experiment, I replaces the big multiset with an array of pointers to cTest. The code now has huge memory leaks, but never mind. The interesting thing is that the relative performance is worse - running the copy constructors in parallel slows them down 4 times!

Scope   Calls       Mean (secs)     Total
    copy_array_thread        2      0.454432        0.908864
      copy_array_main        2      0.116905        0.233811

ravenspoint 2009-09-04 20:32:18

Try allocating and destroying (via new/delete) your profiler objects manually. Maybe it badly interferes with `join()`s?

Pavel Shved 2009-09-04 20:37:09

Answer 5

A:

OK, after spending the majority of the week on this issues, I have the fix.

There were two problems with the code I posted in the question:

boost::bind makes a copy of its parameters, even if the underlying function uses call by reference. Copying the container is expensive, and so the multi-threaded version was working too hard. ( No-one noticed this! ) To pass the container by reference, I needed to use this code:

boost::thread* pt1 = new boost::thread( boost::bind( Copy2, boost::cref(master) ));
As Zan Lynx pointed out the default container allocates memory for its contents on the global heap using a thread safe singleton memory allocator, resulting in great contention between the threads as they created hundreds of thousands of objects through the same allocator instance. ( Since this was the crux of the mystery, I accepted Zan Lynx's answer. )

The fix for #1 is straightforward, as presented above.

The fix for #2 is, as several people pointed out, to replace the default STL allocator with a thread specific one. This is quite the challenge, and no-one offered a specific source for such an allocator.

I spent some time looking for a thread specific allocator "off the shelf". The best I found was hoard ( hoard.org ). This provided a significant performance improvement, however hoard has some serious drawbacks

I experienced some crashes during testing
Commercial licensing is expensive
It 'hooks' system calls to malloc, a technique I consider dodgy.

So I decided to roll my own thread specific memory allocator, based on boost::pool and boost::threadspecificptr. This required a small amount of, IMHO, seriously advanced C++ code, but now seems to be working well.

ravenspoint 2009-09-11 13:32:29

ansaurus

tags:

views:

answers:

Parrallel reads from STL containers

related questions