ansaurus

Question

What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

Answer 1

+1 A:

Update: V5 - Busy Wait in all threads (seems optimal so far)

Since all cores are dedicated to this task, it seemed worth a try to simply eliminate all the complex synchronization constructs and do a busy wait at each synchronization point in all threads. This turns out to beat all other approaches by a wide margin.

The setup is as follows: start with V4 above (CyclicBarrier + Busy Wait). Replace the CyclicBarrier with an AtomicInteger that the main thread resets to zero each cycle. Each worker thread Runnable that completes its work increments the atomic integer by one. The main thread busy waits:

while( true ) {
    // busy-wait for threads to complete their work
    if( atomicInt.get() >= workerThreadCount ) break;
}

Instead of 8, only 7 worker threads are launched (since all threads, including the main thread, now load a core pretty much completely). The results are as follows:

blocksize | system | user | cycles/sec
256k        1.0%     98%       1.36
64k         1.0%     98%       6.8
16k         1.0%     98%      44.6
4096        1.0%     98%     354
1024        1.0%     98%    1189
256         1.0%     98%    3222
64          1.5%     98%    8333
16          2.0%     98%   16129

Using a wait/notify in the worker threads reduces the throughput to about 1/3rd of this solution.

Alex Dunlop 2010-04-26 15:00:26

Answer 2

A:

I also wonder if you could try more than 8 threads. If your CPU supports HyperThreading then (at least in theory) you can squeeze 2 threads per core and see what comes out of it.

Andrew 2010-04-26 21:54:37

Answer 3

+1 A:

Update: V6 - Busy Wait, with main thread also working

An obvious improvement on V5 (busy wait for work in 7 worker threads, busy wait for completion in main thread) seemed to again split the work into 7+1 parts and to let the main thread process one part concurrently with the other worker threads (instead of just busy-waiting), and to subsequently busy-wait for the completion of all other threads' work items. That would utilize the 8th processor (in the example's 8-core configuration) and add its cycles to the available compute resource pool.

This was indeed straight-forward to implement. And the results are indeed again slightly better:

blocksize | system | user | cycles/sec
256k        1.0%     98%       1.39
64k         1.0%     98%       6.8
16k         1.0%     98%      50.4
4096        1.0%     98%     372
1024        1.0%     98%    1317
256         1.0%     98%    3546
64          1.5%     98%    9091
16          2.0%     98%   16949

So this seems to represents the best solution so far.

Alex Dunlop 2010-04-27 08:26:21

Answer 4

+1 A:

Update: V7 - Busy Wait that reverts to Wait/Notify

After some playing around with V6 it turns out that the busy waits obscure the real hotspots of the application a bit when profiling. Plus, the fan on the system keeps going into overdrive even if no work items are being processed. So a further improvement was to busy wait for work items for a fixed amount of time (say, about 2 milliseconds) and then to revert to a "nicer" wait()/notify() combination. The worker threads simply publish their current wait mode to the main thread via an atomic boolean that indicates whether they are busy waiting (and hence just need a work item to be set) or whether they expect a call to notify() because they are in wait().

Another improvement that turned out to be rather straight-forward was to let threads that have completed their primary work item repeatedly invoke a client-supplied callback while they are waiting for the other threads to complete their primary work items. That way, the wait time (which happens because threads are bound to get slightly different work loads) does not need to be completely lost to the app.

I am still very interested in hearing from other users that encountered a similar use case.

Alex Dunlop 2010-04-28 15:31:21

ansaurus

tags:

views:

answers:

What is the fastest cyclic synchronization in Java (ExecutorService vs. CyclicBarrier vs. X)?

V1 - ExecutorService

V2 - CyclicBarrier

V3 - Wait/Notify + CyclicBarrier

V4 - Busy wait instead of wait/notify

V5 - ?

related questions