views:

637

answers:

2

Hello,

I am writing a multi-threaded application in Java in order to improve performance over the sequential version. It is a parallel version of the dynamic programming solution to the 0/1 knapsack problem. I have an Intel Core 2 Duo with both Ubuntu and Windows 7 Professional on different partitions. I am running in Ubuntu.

My problem is that the parallel version actually takes longer than the sequential version. I am thinking this may be because the threads are all being mapped to the same kernel thread or that they are being allocated to the same core. Is there a way I could ensure that each Java thread maps to a separate core?

I have read other posts about this problem but nothing seems to help.

Here is the end of main() and all of run() for the KnapsackThread class (which extends Thread). Notice that they way I use slice and extra to calculate myLowBound and myHiBound ensure that each thread will not overlap in domain of the dynProgMatrix. Therefore there will be no race conditions.

    dynProgMatrix = new int[totalItems+1][capacity+1];
    for (int w = 0; w<= capacity; w++)
        dynProgMatrix[0][w] = 0;
    for(int i=0; i<=totalItems; i++)
        dynProgMatrix[i][0] = 0;
    slice = Math.max(1,
            (int) Math.floor((double)(dynProgMatrix[0].length)/threads.length));
    extra = (dynProgMatrix[0].length) % threads.length;

    barrier = new CyclicBarrier(threads.length);
    for (int i = 0; i <  threads.length; i++){
        threads[i] = new KnapsackThread(Integer.toString(i));
    }
    for (int i = 0; i < threads.length; i++){
        threads[i].start();
    }

    for (int i = 0; i < threads.length; i++){
        try {
            threads[i].join();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}

public void run(){
    int myRank = Integer.parseInt(this.getName());

    int myLowBound;
    int myHiBound;

    if (myRank < extra){
        myLowBound = myRank * (slice + 1);
        myHiBound = myLowBound + slice;
    }
    else{
        myLowBound = myRank * slice + extra;
        myHiBound = myLowBound + slice - 1;
    }

    if(myHiBound > capacity){
        myHiBound = capacity;
    }

    for(int i = 1; i <= totalItems; i++){
        for (int w = myLowBound; w <= myHiBound; w++){

            if (allItems[i].weight <= w){
               if (allItems[i].profit + dynProgMatrix[i-1][w-allItems[i].weight]
                        > dynProgMatrix[i-1][w])
                {
                    dynProgMatrix[i][w] = allItems[i].profit +
                                      dynProgMatrix[i-1][w- allItems[i].weight];
                }
                else{
                    dynProgMatrix[i][w] = dynProgMatrix[i-1][w];
                }
            }
            else{
                dynProgMatrix[i][w] = dynProgMatrix[i-1][w];
            }
        }
        // now place a barrier to sync up the threads
        try {
            barrier.await(); 
        } catch (InterruptedException ex) { 
            ex.printStackTrace();
            return;
        } catch (BrokenBarrierException ex) { 
            ex.printStackTrace(); 
            return;
        }
    }
}

Update:

I have written another version of the knapsack that uses brute force. This version has very little synchronization because I only need to update a bestSoFar variable at the end of a single thread's execution. Therefore, each thread pretty much should execute completely in parallel except for that small critical section at the end.

I ran this versus the sequential brute force and still it takes longer. I don't see any other explanation than that my threads are being run sequentially, either because they are being mapped to the same core or to the same native thread.

Does anybody have any insight?

+10  A: 

I doubt that it will be due to using the same core for all threads. The scheduling is up to the OS, but you should be able to see what's going on if you bring up the performance manager for the OS - it will typically show how busy each core is.

Possible reasons for it taking longer:

  • Lots of synchronization (either necessary or unnecessary)
  • The tasks taking such a short time that thread creation is taking a significant proportion of the time
  • Context switching, if you're creating too many threads - for CPU intensive tasks, create as many as threads as you have cores.
Jon Skeet
Great answer, as usual. But I suspect we can rule out your 2nd bullet point: Looking at the coding, apparently a small number of Threads is being created, and these run -once- for a goodly long time. Synchronization looks like a good bet.
Carl Smotricz
@Carl: We haven't really seen enough code to know. We don't know how big the `threads` array is, unless I've missed something.
Jon Skeet
(`Runtime.availablePrcoessors` will give the number of hardware threads.)
Tom Hawtin - tackline
Runtine.availableProcessors gives me the number of cores right? not threads
Kayton
As the method name already hints, it returns the number of available processors (logical processing units, CPU cores as you want), not threads.
BalusC
A: 

I suggest you take a look at how long it takes for each of your worker threads before they terminate. Perhaps one of the threads has a much more difficult task. If that's the case, then the overhead caused by synchronization and so on will easily eat up what you've gained from threading.

Buhb