views:

117

answers:

6

Hi all.

I'm writing a multi threading java application that runs on Nehalem processor. However I have a problem that starting from 4 threads I almost don't see the speedup in my application.

I've made some simple test. I've created a thread that just allocates a big array and making access to random entries in the array. So when I run number of threads the running time shouldn't change (assuming I'm not exceeding number of available CPU cores). But what I observed is that running 1 or 2 threads takes almost the same time, but running 4 or 8 threads is significantly slower. So before trying so solve algorithmic and synchronization problem in my application I want to find out what is maximal possible parallelization I can achieve.

I've used -XX:+UseNUMA JVM option, so the arrays ought to be allocated in the memory near the corresponding threads.

P.S. If the threads were making a simple mathematical calculation there was no time drop for 4 and even 8 threads, so I concluded that when the threads are accessing memory I have some problems.

Any help or ideas are appreciated, thanks.


EDIT

Thanks you all for the replies. I see that I haven't explained myself good enough.

Before trying to eliminate synchronization problems in my application I made a simple test that check best possible parallelization that could be achieved. The code is as follows:

public class TestMultiThreadingArrayAccess {
    private final static int arrSize = 40000000;

    private class SimpleLoop extends Thread {
        public void run() {
            int array[] = new int[arrSize];
            for (long i = 0; i < arrSize * 10; i++) {
                array[(int) ((i * i) % arrSize)]++; // randomize a bit the access to the array
            }
            long sum = 0;
            for (int i = 0; i < arrSize; i++)
                sum += array[i];
        }
    }

    public static void main(String[] args) {
        TestMultiThreadingArrayAccess test = new TestMultiThreadingArrayAccess();
        for (int threadsNumber : new int[] { 1, 2, 4, 8 }) {
            Statistics timer = new Statistics("Executing " + threadsNumber+ " threads"); // Statistics is a simple helper class that measures the times
            timer.start();
            test.doTest(threadsNumber);
            timer.stop();
            System.out.println(timer.toString());
        }
    }

    public void doTest(int threadsNumber) {
        Thread threads[] = new Thread[threadsNumber];
        for (int i = 0; i < threads.length; i++) {
            threads[i] = new SimpleLoop();
            threads[i].start();
        }

        for (int i = 0; i < threads.length; i++)
            try {
                threads[i].join();
            } catch (InterruptedException e) {
            };
    }
}

So as you see there is no synchronization at all in this minitest and also the allocation of the array is inside the thread so it should be placed in the chunk of the memory that could be accessed quickly. Also there is no memory contentions in this code. Still for 4 threads there is a drop of 30% in the running time, and 8 threads runs twice slower. As you from the code I just wait until all threads finish theirs work, and since theirs work is independent number of threads shouldn't affect the total time the execution takes.

On the machine installed 2 quad-core hyperthreaded Nehalem processors (totally 16 CPUs), so with 8 threads each one could catch it's CPU exclusively.

When I tried to run this test with smaller array (20K entries) the drop of the execution time of 4 threads was 7% and 8 threads - 14%, which is satisfying. But when I try to operate random accessed on large array (40M entries) running times increase dramatically, so I think that there is problem that big chunks of memory (because they don't fit in the cache memory?) are accessed in a non-efficient way.

Are there any ideas how to fix this?

Hope this clarifies the question in a better way, thanks again.

+1  A: 

Without knowing what exactly you are doing and what is the problem your trying to solve. It looks like you have heavy synchronization around your code, since it could be the main reason for not to be scalable enough. Over synchronization cause to slow down any speedup, once it make your application almost serial. So my suggestion to you is to inspect your implementation and trying to figure this out.

ADD.

After you've added your implementation of what you are doing. The downgrade of performance could be explained by large and massive memory access. Once you running all you thread and they need to access memory controller for not cached data, since they running on different CPU's, memory controller prevents CPU's from doing it simultaneously, meaning there is a synchronization at hardware level on each cache miss. In you case it's almost equal as if you were running 10 different independent programs. I guess if you will launch 10 (you can replace 10 by any large number) copies your web browser, for instance, you will see same effect, but this doesn't mean that browser implementation is ineffective, you just create a huge burden on computer memory.

Artem Barger
I've added a snippet of the implementation. There you can see that there is no synchronization at all.
jutky
Added an extended answer.
Artem Barger
A: 

As Artem notes, it is possible that you have unnecessary synchronization. But I'd start by establishing the facts. Is your app REALLY running slower as you describe?

Here is an insightful article on the subject: http://codeidol.com/java/java-concurrency/Testing-Concurrent-Programs/Avoiding-Performance-Testing-Pitfalls/

It's actually quite tough to write useful micro benchmarks, especially when you are dealing with concurrent code. For example, you can have "Dead code elimination" in which the compiler optimizes the code away you think is being executed. It's also difficult to guess when garbage collection runs. Hotspot's runtime optimization makes also measurement more difficult. In case of threads, you need to take in account the time that is used to create them. So you might need to use a `CyclicBarrier` etc. to have accurate measurement. Things like that..

Having said that, I find it hard that you'll have issues in accessing memory if all you are doing is reading. We might be able to help you better if you can post the code...

Enno Shioji
Thanks, I'll read the article. The code is added.
jutky
A: 

There are two obvious potential problems that spring to mind.

  • Using more thread allocates more arrays which burst the cache. Accesses to main memory or lower levels of cache are much slower.
  • If you are using the same source of instance of random number generator, then threads will be fighting over access to it. It might not be full synchronisation, but instead memory barriers with a lock-free algorithm. Generally lock-free algorithms, although generally fast, get much slower under high contention.
Tom Hawtin - tackline
Isn't high level cash is separate for each core of the CPU? If this is the case there shouldn't be a problem with number of threads because each of them would place it's array in the high level cash of it's CPU.To generate random number I just use( (i^2) % array_size) so this is not a bottleneck.
jutky
@jutky Of course it depends upon the architecture. Certainly it is common to have multiple hardware threads sharing the same highest level cache.
Tom Hawtin - tackline
as I see here: http://en.wikipedia.org/wiki/Nehalem_(microarchitecture) in Nehalem processor there is a high level cash per each core. However it is small so this could explain the reason I don't have running time slowdown for small arrays. Thanks.
jutky
A: 

Aside from concurrency problems the most likely cause of your slowup is memory cache contention.

If all threads are accessing the same piece of storage the chances are its in the other processers memory cache when you want to access it.

If the storage is "read only" you could give each thread its own copy which would allow the JVM & processor to optimise the memory acccesses.

James Anderson
I've added a code snippet of the test I run. As you see each thread is accessing it's own array so there shouldn't be a memory contention problem.
jutky
A: 

I modified your test with the advice from the article I posted. On my 2 core machine (that's all I have right now) result seems reasonable (note that I ran 2 tests for each thread number):

Maybe you can try this? (Please note that I had to modify your test slightly (see comment) because it took very long to run on my poor hardware)

Also note that I run this test using the -server option.

Test with threadNum 1 took 2095717473 ns
Test with threadNum 1 took 2121744523 ns
Test with threadNum 2 took 2489853040 ns
Test with threadNum 2 took 2465152974 ns
Test with threadNum 4 took 5044335803 ns
Test with threadNum 4 took 5041235688 ns
Test with threadNum 8 took 10279012556 ns
Test with threadNum 8 took 10347970483 ns

code:

import java.util.concurrent.*;

public class Test{
    private final static int arrSize = 20000000;

    public static void main(String[] args) throws Exception {
        int[] nums = {1,1,2,2,4,4,8,8};//allow hotspot optimization
        for (int threadNum : nums) {
            final CyclicBarrier gate = new CyclicBarrier(threadNum+1);
            final CountDownLatch latch = new CountDownLatch(threadNum);
            ExecutorService exec = Executors.newFixedThreadPool(threadNum);
            for(int i=0; i<threadNum; i++){
                Runnable test = 
                  new Runnable(){
                     public void run() {
                         try{
                             gate.await();
                         }catch(Exception e){
                             throw new RuntimeException(e);
                         }
                         int array[] = new int[arrSize];
                         //arrSize * 10 took very long to run so made it
                         // just arrSize.
                         for (long i = 0; i < arrSize; i++) {
                             array[(int) ((i * i) % arrSize)]++;
                         }//for
                         long sum = 0;
                         for (int i = 0; i < arrSize; i++){
                              sum += array[i]; 
                         }
                         if(new Object().hashCode()==sum){
                              System.out.println("oh");
                         }//if
                         latch.countDown();
                      }//run
                   };//test
                exec.execute(test);
             }//for
             gate.await();
             long start = System.nanoTime();
             latch.await();
             long finish = System.nanoTime();
             System.out.println("Test with threadNum " +
                 threadNum +" took " + (finish-start) + " ns ");
             exec.shutdown();
             exec.awaitTermination(Long.MAX_VALUE,TimeUnit.SECONDS);           
        }//for
    }//main

}//Test
Enno Shioji
A: 

The bottleneck in the test is the cpu to memory bandwith. Even when local memory is available, it is going to be shared by some number of threads. (The memory is local to a node, not to a specific core.) Once CPU can easily exceed the available bandwidth for a simple loop like your above test, and so increasing threads on such a test will not improve performance, and can worsen performance due to worsened cache coherence.

Just a sanity test, are you also using the parallel collector? -XX:+UseParallelGC. UseNUMA takes effect only then.

mdma
No, I use `-XX:+UseConcMarkSweepGC` , I'll try parallel GC, thanks for the advice.
jutky
Thanks a lot. This really improved the running times greatly. Now for 4 threads it takes 10% more time, and for 8 threads it takes 25% more time.
jutky