views:

219

answers:

6
+3  A: 

What's the rest of your environment like? Is this repeatable?

At least on a UNIX box, a long running single process like that is likely to get nice-ed down in priority; if you have 10 threads, each one gets its own slice of the CPU, and so won't accumulate as much CPU time. It will then not lose priority to nice-ification. Overall, it gets a bigger total chunk of CPU.

Added

Just for completeness, this is what your code gives on a dual core mac mini under OS/X 10.5.6

527 $ java MultithreadedCompute
Creating workers
Starting workers
Computing sum 100000000 + ... + (200000000 - 1)
Computing sum 0 + ... + (100000000 - 1)
Computing sum 400000000 + ... + (500000000 - 1)
Computing sum 200000000 + ... + (300000000 - 1)
Computing sum 500000000 + ... + (600000000 - 1)
Computing sum 600000000 + ... + (700000000 - 1)
Computing sum 700000000 + ... + (800000000 - 1)
Computing sum 800000000 + ... + (900000000 - 1)
Computing sum 900000000 + ... + (1000000000 - 1)
Computing sum 300000000 + ... + (400000000 - 1)
Joined with thread 0
Joined with thread 1
Joined with thread 2
Joined with thread 3
Joined with thread 4
Joined with thread 5
Joined with thread 6
Joined with thread 7
Joined with thread 8
Joined with thread 9
Summing worker totals
total=499999999500000000
Elapsed time: 3217 ms
528 $ java SingleThreadedCompute
total=499999999500000000
Elapsed time: 5651 ms
529 $

As you can see, the threads don't necessarily run sequentially, and the run time fof the multithreaded is about 56 percent of the singe thread, indicating it is taking advantage of the threading.

Charlie Martin
+6  A: 

It's possible that this is due to hyper-threading and/or pipelining.

From wikipedia on hyper-threading:

Hyper-threading is an advancement over super-threading. Hyper-threading (officially termed Hyper-Threading Technology or HTT) is an Intel-proprietary technology used to improve parallelization of computations (doing multiple tasks at once) performed on PC microprocessors. A processor with hyper-threading enabled is treated by the operating system as two processors instead of one. This means that only one processor is physically present but the operating system sees two virtual processors, and shares the workload between them.

From wikipedia on piplining:

In computing, a pipeline is a set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion

Software Monkey
I agree with Software Monkey, it sounds like Hyper-threading you should be able to disable/enable this in the BIOS
ninesided
Yes, I remember when I had Pentium IV HT, it shows me 2 cores in operating system.
Adeel Ansari
+3  A: 

I tried turning off the JIT as Pax suggested in the comment above. Pax, if you want to post a quick "Turn off the JIT" answer I'll credit your solution.

Anyway turning off the JIT worked (meaning that it brought the actual results in line with the expected results). I had to back away from one billion as it was taking forever, so I went with 100 million instead. The results are much more in line with what I would expect. Here they are:

FIVE NO-JIT SINGLE-THREADED RUNS

total=4999999950000000 Elapsed time: 17094 ms

total=4999999950000000 Elapsed time: 17109 ms

total=4999999950000000 Elapsed time: 17219 ms

total=4999999950000000 Elapsed time: 17375 ms

total=4999999950000000 Elapsed time: 17125 ms


FIVE NO-JIT MULTITHREADED RUNS

total=4999999950000000 Elapsed time: 18719 ms

total=4999999950000000 Elapsed time: 18750 ms

total=4999999950000000 Elapsed time: 18610 ms

total=4999999950000000 Elapsed time: 18890 ms

total=4999999950000000 Elapsed time: 18719 ms


Thanks guys for the ideas and help.

Willie Wheeler
Yes, but *why* is the machine code different; with JIT on both versions were compiled. There is something important here. And realistically, it's the JIT case that's important in real life.
Software Monkey
It's different because the two programs are themselves different. JIT kicks in when it thinks there's a benefit to doing so. It appears that it's kicking in earlier (for whatever reason) with the MT program.
Willie Wheeler
Alright, I wanted to wait for Pax to post his correct answer so I could credit it, but he didn't so I'm gonna accept this one (which was really his answer). :-D
Willie Wheeler
+1  A: 

A tenth of a second difference? Noise from startup time (alone) will swamp that. Write something that runs for a minute or two.

paulmurray
Well, it's a one second difference (or about 1.5 seconds in the no-JIT case), but I'll give it a try. In the JIT case I would expect to see some convergence between the ST and MT runs just because JIT runtime as a % of overall runtime grows. Lemme give it a try though. One sec.
Willie Wheeler
Hi Paul. I put the sum back up to 1B (instead of 100M) and reran. I just did one sample each for ST and MT but here they are: ST=180281 ms; MT=186516 ms. ST is 6+ seconds faster, which is consistent with the expectation I originally announced.
Willie Wheeler
A: 

An attempt to eliminate variance due to HotSpot between the code executed by single- and multi-threaded variants:

public class ThreadedWorkers {
    private static final long _1B = 1000000000L; // one billion
    private static final long _100M = _1B / 10L;

    enum ThreadMode { SINGLE, SEQUENTIAL, MULTI };

    public static void main(String[] args) throws InterruptedException {
        final long startMs = System.currentTimeMillis();

        ThreadMode mode = args.length == 0 ? ThreadMode.SINGLE : ThreadMode.valueOf(args[0].toUpperCase());

        final long total = computeTotal( mode );

        System.out.println("total=" + total);

        long elapsedMs = System.currentTimeMillis() - startMs;

        System.out.println("Elapsed time: " + elapsedMs + " ms");
    }

    public static long computeTotal (ThreadMode mode) throws InterruptedException {
        Worker[] workers = new Worker[10];

        for (int i = 0; i < 10; i++)
            workers[i] = new Worker(i * _100M, (i+1) * _100M);

        switch (mode) {
            case SINGLE: {
                for (Worker worker : workers )
                    worker.run();

                break;
            } 

            case SEQUENTIAL:{
                for (Worker worker : workers ) {
                    worker.start();
                    worker.join();
                }

                break;
            }

            case MULTI: {
                for (Worker worker : workers )
                    worker.start();

                for (Worker worker : workers )
                    worker.join();

                break;
            }
        }

        System.out.println("Summing worker totals");

        long total = 0;

        for (Worker worker : workers )
            total += worker.getTotal();

        return total;
    }

    static class Worker extends Thread {
        private long start, end, total;

        public Worker(long start, long end) {
            this.start = start;
            this.end = end;
        }

        public void run() {
            System.out.println("Computing sum " + start + " + ... + (" + end + " - 1)");
            for (long i = start; i < end; i++) { total += i; }
        }

        public long getTotal() { return total; }
    }
}

This still runs faster as multi than in sequential or single (by about 10 seconds on an eee pc 900 - 23 vs 13 seconds), even though sequential is executing the same methods as multi the same number of times.

Pete Kirkham
A: 

Just because it is fun... the result from a 8-core server class machine. AMD 2.7GHz Shanghai cpus

Creating workers
Starting workers
Computing sum 0 + ... + (100000000 - 1)
Computing sum 100000000 + ... + (200000000 - 1)
Computing sum 300000000 + ... + (400000000 - 1)
Computing sum 500000000 + ... + (600000000 - 1)
Computing sum 600000000 + ... + (700000000 - 1)
Computing sum 200000000 + ... + (300000000 - 1)
Computing sum 800000000 + ... + (900000000 - 1)
Computing sum 700000000 + ... + (800000000 - 1)
Computing sum 900000000 + ... + (1000000000 - 1)
Computing sum 400000000 + ... + (500000000 - 1)
Joined with thread 0
Joined with thread 1
Joined with thread 2
Joined with thread 3
Joined with thread 4
Joined with thread 5
Joined with thread 6
Joined with thread 7
Joined with thread 8
Joined with thread 9
Summing worker totals
total=499999999500000000
Elapsed time: 444 ms
ReneS