ansaurus

Question

ExecutorService's surprising performance break-even point --- rules of thumb?

Answer 1

+2 A:

This is not a fair test for the thread pool for following reasons,

You are not taking advantage of the pooling at all because you only have 1 thread.
The job is too simple that the pooling overhead can't be justified. A multiplication on a CPU with FPP only takes a few cycles.

Considering following extra steps the thread pool has to do besides object creation and the running the job,

Put the job in the queue
Remove the job from queue
Get the thread from the pool and execute the job
Return the thread to the pool

When you have a real job and multiple threads, the benefit of the thread pool will be apparent.

ZZ Coder 2009-10-30 04:49:25

I second ZZ Coder; in my experience the benefits will become more apparent when your thread pool is larger.

Everyone 2009-10-30 06:05:57

The executor doesn't have to "get" and "return" a thread. It creates an internal worker thread that poll()s the queue of tasks.Also, given the low time complexity of the task, it is probably an advantage to use only one thread, otherwise, there is a chance of the lock in the BlockingQueue being contended and causing issues with moving worker threads in and out of the Runnable state.Real cost? Going to the kernel to create a thread and also calling a blocking operation while waiting for the thread to terminate.100,000 isn't a lot. But lesson learned, performance tuning requires testing.

Tim Bender 2009-10-30 09:43:24

I did try thread pool sizes between 1 and 8, they all returned about the same numbers. I concentrated on pool size of 1 because I wanted to measure the overhead of the executor framework. Your comment does reinforce that I need to further study the internals of the framework.

Shahbaz 2009-10-31 06:38:27

Answer 2

A:

Firstly there's a few issues with the microbenchmark. You do a warm up, which is good. However, it is better to run the test multiple times, which should give a feel as to whether it has really warmed up and the variance of the results. It also tends to be better to do the test of each algorithm in separate runs, otherwise you might cause deoptimisation when an algorithm changes.

The task is very small, although I'm not entirely sure how small. So number of times faster is pretty meaningless. In multithreaded situations, it will touch the same volatile locations so threads could cause really bad performance (use a Random instance per thread). Also a run of 47 milliseconds is a bit short.

Certainly going to another thread for a tiny operation is not going to be fast. Split tasks up into bigger sizes if possible. JDK7 looks as if it will have a fork-join framework, which attempts to support fine tasks from divide and conquer algorithms by preferring to execute tasks on the same thread in order, with larger tasks pulled out by idle threads.

Tom Hawtin - tackline 2009-10-30 04:59:15

Good point about running the test several times. I actually did run it many times, I just pasted a single result. I do get your point about improving the benchmark.

Shahbaz 2009-10-31 06:40:06

Answer 3

+6 A:

Using executors is about utilizing CPUs and / or CPU cores, so if you create a thread pool that utilizes the amount of CPUs at best, you have to have as many threads as CPUs / cores.
You are right, creating new objects costs too much. So one way to reduce the expenses is to use batches. If you know the kind and amount of computations to do, you create batches. So think about thousand(s) computations done in one executed task. You create batches for each thread. As soon as the computation is done (java.util.concurrent.Future), you create the next batch. Even the creation of new batches can be done in parralel (4 CPUs -> 3 threads for computation, 1 thread for batch provisioning). In the end, you may end up with more throughput, but with higher memory demands (batches, provisioning).

Edit: I changed your example and I let it run on my little dual-core x200 laptop.

provisioned 2 batches to be executed
simpleCompuation:14
computationWithObjCreation:17
computationWithObjCreationAndExecutors:9

As you see in the source code, I took the batch provisioning and executor lifecycle out of the measurement, too. That's more fair compared to the other two methods.

See the results by yourself...

import java.util.List;
import java.util.Vector;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class ExecServicePerformance {

    private static int count = 100000;

    public static void main( String[] args ) throws InterruptedException {

        final int cpus = Runtime.getRuntime().availableProcessors();

        final ExecutorService es = Executors.newFixedThreadPool( cpus );

        final Vector< Batch > batches = new Vector< Batch >( cpus );

        final int batchComputations = count / cpus;

        for ( int i = 0; i < cpus; i++ ) {
            batches.add( new Batch( batchComputations ) );
        }

        System.out.println( "provisioned " + cpus + " batches to be executed" );

        // warmup
        simpleCompuation();
        computationWithObjCreation();
        computationWithObjCreationAndExecutors( es, batches );

        long start = System.currentTimeMillis();
        simpleCompuation();
        long stop = System.currentTimeMillis();
        System.out.println( "simpleCompuation:" + ( stop - start ) );

        start = System.currentTimeMillis();
        computationWithObjCreation();
        stop = System.currentTimeMillis();
        System.out.println( "computationWithObjCreation:" + ( stop - start ) );

        // Executor

        start = System.currentTimeMillis();
        computationWithObjCreationAndExecutors( es, batches );    
        es.shutdown();
        es.awaitTermination( 10, TimeUnit.SECONDS );
        // Note: Executor#shutdown() and Executor#awaitTermination() requires
        // some extra time. But the result should still be clear.
        stop = System.currentTimeMillis();
        System.out.println( "computationWithObjCreationAndExecutors:"
                + ( stop - start ) );
    }

    private static void computationWithObjCreation() {

        for ( int i = 0; i < count; i++ ) {
            new Runnable() {

                @Override
                public void run() {

                    double x = Math.random() * Math.random();
                }

            }.run();
        }

    }

    private static void simpleCompuation() {

        for ( int i = 0; i < count; i++ ) {
            double x = Math.random() * Math.random();
        }

    }

    private static void computationWithObjCreationAndExecutors(
            ExecutorService es, List< Batch > batches )
            throws InterruptedException {

        for ( Batch batch : batches ) {
            es.submit( batch );
        }

    }

    private static class Batch implements Runnable {

        private final int computations;

        public Batch( final int computations ) {

            this.computations = computations;
        }

        @Override
        public void run() {

            int countdown = computations;
            while ( countdown-- > -1 ) {
                double x = Math.random() * Math.random();
            }
        }
    }
}

oeogijjowefi 2009-10-30 10:43:57

Interesting solution. Gives me some ideas about how to change my use of executors.

Shahbaz 2009-10-31 06:33:27

+1, very nice example.

Andrzej Doyle 2009-11-02 08:53:27

hi, if I run this example on a MacOsx dual core, I got:simpleComputation: 268computationWithObjCreation: 155computation2: 0,because the result of computationWithObjCreationAndExecutors is not retrieved?If I moved the es.shutdown() and es.awaitTermination before we take the stop time, then the result:provisioned: 2 batches to be executedsimpleComputation: 261computationWithObjCreation: 92computationWithObjCreationAndExecutors: 126where computationWithObjCreationAndExecutors consistently performs worse than computationWithObjCreation. Why is this happening?

portoalet 2010-03-14 12:19:53

if i only modified the computationWithObjCreationAndExecutors so I have es.submit(batch).get(), then the time gets reduced for all 3 of them, i.e.provisioned: 2 batches to be executedsimpleComputation: 96computationWithObjCreation: 102computationWithObjCreationAndExecutors: 96Am i missing something?

portoalet 2010-03-14 12:24:04

You're right. Of course, the stopwatch of 'computationWithObjCreationAndExecutors' needs to be stopped after Executor#awaitTermination() is invoked. I'll update it in the code. The parallel executed batches are still faster (on multiple cores). But there is extra time on internal stuff that's going on when Executor#awaitTermination() and Executor#shutdown() is invoked. So it's a little less than displayed. ... Although, in the end, the execution depends on your environment. There may be differences on core/cpu-usage (VM, VM options, Scheduler of your OS) on the OSX platform...

oeogijjowefi 2010-03-15 16:27:32

...There are chances, that the the execution of the batches will perform less. Maybe the OS blocks the execution of one or more cores during batch execution for tasks with higher priority. The example above only should illustrate the concept of parallel execution and batches for computations of larger datasets. Even the simple calculation problem will not result in a adequate benchmark. A real benchmark would require much more effort and a clean environment without disturbing tasks of your System (like checking instant messenger or mail updates, update your window, update your clock, ... ;)

oeogijjowefi 2010-03-15 16:38:40

Answer 4

+2 A:

I don't think this is at all realistic since you're creating a new executor service every time you make the method call. Unless you have very strange requirements that seems unrealistic - typically you'd create the service when your app starts up, and then submit jobs to it.

If you try the benchmarking again but initialise the service as a field, once, outside the timing loop; then you'll see the actual overhead of submitting Runnables to the service vs. running them yourself.

But I don't think you've grasped the point fully - Executors aren't meant to be there for efficiency, they're there to make co-ordinating and handing off work to a thread pool simpler. They will always be less efficient than just invoking Runnable.run() yourself (since at the end of the day the executor service still needs to do this, after doing some extra housekeeping beforehand). It's when you are using them from multiple threads needing asynchronous processing, that they really shine.

Also consider that you're looking at the relative time difference of a basically fixed cost (Executor overhead is the same whether your tasks take 1ms or 1hr to run) compared to a very small variable amount (your trivial runnable). If the executor service takes 5ms extra to run a 1ms task, that's not a very favourable figure. If it takes 5ms extra to run a 5 second task (e.g. a non-trivial SQL query), that's completely negligible and entirely worth it.

So to some extent it depends on your situation - if you have an extremely time-critical section, running lots of small tasks, that don't need to be executed in parallel or asynchronously then you'll get nothing from an Executor. If you're processing heavier tasks in parallel and want to respond asynchronously (e.g. a webapp) then Executors are great.

Whether they are the best choice for you depends on your situation, but really you need to try the tests with realistic representative data. I don't think it would be appropriate to draw any conclusions from the tests you've done unless your tasks really are that trivial (and you don't want to reuse the executor instance...).

Andrzej Doyle 2009-10-30 12:04:28

I initialize the executor inside a method, but not inside the loop. I used methods simply to keep the tests separate.I know that executors have their overhead, I was surprise that it was so high.Unfortunately (or fortunately), most of my computations really are that trivial (simple arithmetic), except they are done on a lot of messages. Think a messaging system which handles a flood of messages, but transformation of each message is not overly expensive. What I am getting from this is that I need to make my program concurrent at different granularity from what I was originally thinking.

Shahbaz 2009-10-31 06:15:58

ansaurus

tags:

views:

answers:

ExecutorService's surprising performance break-even point --- rules of thumb?

related questions