views:

643

answers:

9

On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.

What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.

NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.

#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>

class Thread{
    private:
            pthread_t thread;
            static void *thread_func(void *d){((Thread *)d)->run();}
    public:
            Thread(){}
            virtual ~Thread(){}

            virtual void run(){}
            int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
            int wait(){return pthread_join(thread, NULL);}
};


#include <iostream>

const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];

int main(void)
{

    class Thread_a:public Thread{
            public:
                    Thread_a(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };
    class Thread_b:public Thread{
            public:
                    Thread_b(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };

    class Thread_c:public Thread{
            public:
                    Thread_c(int* a): arr_(a) {}
                    void run()
                    {
                            for(int n = 0; n<N; n++)
                            for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
                    }
            private:
                    int* arr_;
    };

    {
            Thread *a=new Thread_a(arr);
            Thread *b=new Thread_b(arr);
            Thread *c=new Thread_c(arr);

            clock_t start = clock();

            if (a->start() != 0) {
                    return 1;
            }

            if (b->start() != 0) {
                    return 1;
            }
            if (c->start() != 0) {
                    return 1;
            }

            if (a->wait() != 0) {
                    return 1;
            }

            if (b->wait() != 0) {
                    return 1;
            }

            if (c->wait() != 0) {
                    return 1;
            }

            clock_t end = clock();
            double duration = (double)(end - start) / CLOCKS_PER_SEC;

            std::cout << duration << "seconds\n";
            delete a;
            delete b;

    }
    {
            clock_t start = clock();
            for(int n = 0; n<N; n++)
            for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
            clock_t end = clock();
            double duration = (double)(end - start) / CLOCKS_PER_SEC;

            std::cout << "serial: " << duration << "seconds\n";
    }

    return 0;
  }


See also: http://stackoverflow.com/questions/612860/what-can-make-a-program-run-slower-when-using-more-threads

+6  A: 

The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.

tstenner
I don't get the I/O bound thing. The only I/O is writing out the timing information. Do you mean memory bound? That, I agree with.
tvanfosson
I'm treating memory as IO here, as you don't have any major calculations.
tstenner
A: 

Some useful discussion can be found at: http://stackoverflow.com/questions/503551/does-it-make-sense-to-spawn-more-than-one-thread-per-processor

And possible at:

dmckee
dual core = 2 CPUs
Amit Kumar
@Amit Kumar: And you appear to spawn three processes, no? And (3 > 2), right?
dmckee
not forgetting the original thread which waits on the three, making four.
Pete Kirkham
Fundamentally I agree. However, the original thread just waits, so it does not add much to the overhead. The results were same when 2 threads instead of 3 were used (as described in the question). The code is bandwidth limited, but context switching is not a major overhead in the current situation.
Amit Kumar
A: 

Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:

  • a proper parallelization of your algorithm
  • a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
  • hardware support for parallelization

It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...

But this is then only a theoretical standpoint.

Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.

dirkgently
One algorithm? I assume you meant one processor.
jalf
Yup. A typo. Thanks.
dirkgently
> Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Dual core = 2 CPUs.
Amit Kumar
That was a general comment. Your algorithm isn't parallelized. What else have you done to take advantage of the fact that you have two cores?
dirkgently
A: 

As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.

wekempf
+3  A: 

I believe that your algorithm essentially makes your cache memory useless.

Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.

tvanfosson
+1  A: 

Not related to your threading issues, but there is a bounds error in your code. You have:

for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}

When i is zero you will be doing

arr[0] += arr[-1];
GrahamS
do not control concurrence for writing at "arr", and forgotten delete c thread.
lsalamon
yup, my focus in this toy code was just comparing the performance.
Amit Kumar
yup, correct code is better to evaluate.
lsalamon
+14  A: 

The times you are reporting are measured using the clock function:

The clock() function returns an approximation of processor time used by the program.

$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds

real    0m5.247s
user    0m9.025s
sys 0m0.304s

The real time will be less for multiprocessor tasks, but the processor time will typically be greater.

When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.

It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.

Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:

$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds

real    0m5.326s
user    0m9.057s
sys 0m0.344s

which still is less of a speed-up than one might hope for, but at least the numbers make some sense.

100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)

Pete Kirkham
> clock() value shows that most of the time some of your processors are waiting for data. My guess is that the processor is busy with no-ops while waiting for data. This shows up in the return value of clock(). But just a guess.
Amit Kumar
A: 

tstenner has got it mostly right.

This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.

Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.

You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.

Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.

user9876
Amit Kumar
> You have 3 threads that are trampling all over the same memory. I am not sure whether this is really the case.
Amit Kumar
A: 

Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'

volatilevoid