ansaurus

Question

Answer 1

+1 A:

Without knowing what exactly you are doing and what is the problem your trying to solve. It looks like you have heavy synchronization around your code, since it could be the main reason for not to be scalable enough. Over synchronization cause to slow down any speedup, once it make your application almost serial. So my suggestion to you is to inspect your implementation and trying to figure this out.

ADD.

After you've added your implementation of what you are doing. The downgrade of performance could be explained by large and massive memory access. Once you running all you thread and they need to access memory controller for not cached data, since they running on different CPU's, memory controller prevents CPU's from doing it simultaneously, meaning there is a synchronization at hardware level on each cache miss. In you case it's almost equal as if you were running 10 different independent programs. I guess if you will launch 10 (you can replace 10 by any large number) copies your web browser, for instance, you will see same effect, but this doesn't mean that browser implementation is ineffective, you just create a huge burden on computer memory.

Artem Barger 2010-07-15 06:37:14

I've added a snippet of the implementation. There you can see that there is no synchronization at all.

jutky 2010-07-15 11:53:27

Added an extended answer.

Artem Barger 2010-07-15 16:42:18

Answer 2

A:

As Artem notes, it is possible that you have unnecessary synchronization. But I'd start by establishing the facts. Is your app REALLY running slower as you describe?

Here is an insightful article on the subject: http://codeidol.com/java/java-concurrency/Testing-Concurrent-Programs/Avoiding-Performance-Testing-Pitfalls/

It's actually quite tough to write useful micro benchmarks, especially when you are dealing with concurrent code. For example, you can have "Dead code elimination" in which the compiler optimizes the code away you think is being executed. It's also difficult to guess when garbage collection runs. Hotspot's runtime optimization makes also measurement more difficult. In case of threads, you need to take in account the time that is used to create them. So you might need to use a ｀CyclicBarrier` etc. to have accurate measurement. Things like that..

Having said that, I find it hard that you'll have issues in accessing memory if all you are doing is reading. We might be able to help you better if you can post the code...

Enno Shioji 2010-07-15 07:11:18

Thanks, I'll read the article. The code is added.

jutky 2010-07-15 11:59:42

Answer 3

A:

There are two obvious potential problems that spring to mind.

Using more thread allocates more arrays which burst the cache. Accesses to main memory or lower levels of cache are much slower.
If you are using the same source of instance of random number generator, then threads will be fighting over access to it. It might not be full synchronisation, but instead memory barriers with a lock-free algorithm. Generally lock-free algorithms, although generally fast, get much slower under high contention.

Tom Hawtin - tackline 2010-07-15 08:18:57

Isn't high level cash is separate for each core of the CPU? If this is the case there shouldn't be a problem with number of threads because each of them would place it's array in the high level cash of it's CPU.To generate random number I just use( (i^2) % array_size) so this is not a bottleneck.

jutky 2010-07-15 11:56:52

@jutky Of course it depends upon the architecture. Certainly it is common to have multiple hardware threads sharing the same highest level cache.

Tom Hawtin - tackline 2010-07-15 12:31:28

as I see here: http://en.wikipedia.org/wiki/Nehalem_(microarchitecture) in Nehalem processor there is a high level cash per each core. However it is small so this could explain the reason I don't have running time slowdown for small arrays. Thanks.

jutky 2010-07-16 06:17:25

Answer 4

A:

Aside from concurrency problems the most likely cause of your slowup is memory cache contention.

If all threads are accessing the same piece of storage the chances are its in the other processers memory cache when you want to access it.

If the storage is "read only" you could give each thread its own copy which would allow the JVM & processor to optimise the memory acccesses.

James Anderson 2010-07-15 08:25:28

I've added a code snippet of the test I run. As you see each thread is accessing it's own array so there shouldn't be a memory contention problem.

jutky 2010-07-15 11:58:08

Answer 5

A:

I modified your test with the advice from the article I posted. On my 2 core machine (that's all I have right now) result seems reasonable (note that I ran 2 tests for each thread number):

Maybe you can try this? (Please note that I had to modify your test slightly (see comment) because it took very long to run on my poor hardware)

Also note that I run this test using the -server option.

Test with threadNum 1 took 2095717473 ns
Test with threadNum 1 took 2121744523 ns
Test with threadNum 2 took 2489853040 ns
Test with threadNum 2 took 2465152974 ns
Test with threadNum 4 took 5044335803 ns
Test with threadNum 4 took 5041235688 ns
Test with threadNum 8 took 10279012556 ns
Test with threadNum 8 took 10347970483 ns

code:

import java.util.concurrent.*;

public class Test{
    private final static int arrSize = 20000000;

    public static void main(String[] args) throws Exception {
        int[] nums = {1,1,2,2,4,4,8,8};//allow hotspot optimization
        for (int threadNum : nums) {
            final CyclicBarrier gate = new CyclicBarrier(threadNum+1);
            final CountDownLatch latch = new CountDownLatch(threadNum);
            ExecutorService exec = Executors.newFixedThreadPool(threadNum);
            for(int i=0; i<threadNum; i++){
                Runnable test = 
                  new Runnable(){
                     public void run() {
                         try{
                             gate.await();
                         }catch(Exception e){
                             throw new RuntimeException(e);
                         }
                         int array[] = new int[arrSize];
                         //arrSize * 10 took very long to run so made it
                         // just arrSize.
                         for (long i = 0; i < arrSize; i++) {
                             array[(int) ((i * i) % arrSize)]++;
                         }//for
                         long sum = 0;
                         for (int i = 0; i < arrSize; i++){
                              sum += array[i]; 
                         }
                         if(new Object().hashCode()==sum){
                              System.out.println("oh");
                         }//if
                         latch.countDown();
                      }//run
                   };//test
                exec.execute(test);
             }//for
             gate.await();
             long start = System.nanoTime();
             latch.await();
             long finish = System.nanoTime();
             System.out.println("Test with threadNum " +
                 threadNum +" took " + (finish-start) + " ns ");
             exec.shutdown();
             exec.awaitTermination(Long.MAX_VALUE,TimeUnit.SECONDS);           
        }//for
    }//main

}//Test

Enno Shioji 2010-07-15 13:19:27

Answer 6

A:

The bottleneck in the test is the cpu to memory bandwith. Even when local memory is available, it is going to be shared by some number of threads. (The memory is local to a node, not to a specific core.) Once CPU can easily exceed the available bandwidth for a simple loop like your above test, and so increasing threads on such a test will not improve performance, and can worsen performance due to worsened cache coherence.

Just a sanity test, are you also using the parallel collector? -XX:+UseParallelGC. UseNUMA takes effect only then.

mdma 2010-07-15 13:35:16

No, I use `-XX:+UseConcMarkSweepGC` , I'll try parallel GC, thanks for the advice.

jutky 2010-07-16 06:20:40

Thanks a lot. This really improved the running times greatly. Now for 4 threads it takes 10% more time, and for 8 threads it takes 25% more time.

jutky 2010-07-18 11:40:17

ansaurus

tags:

views:

answers:

Memory access by multiple threads

related questions