ansaurus

Question

Negative Speedup on Multithreading my Program

Answer 1

+6 A:

The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.

tstenner 2009-03-25 17:47:06

I don't get the I/O bound thing. The only I/O is writing out the timing information. Do you mean memory bound? That, I agree with.

tvanfosson 2009-03-25 17:53:38

I'm treating memory as IO here, as you don't have any major calculations.

tstenner 2009-03-25 18:10:40

Answer 2

A:

Some useful discussion can be found at: http://stackoverflow.com/questions/503551/does-it-make-sense-to-spawn-more-than-one-thread-per-processor

And possible at:

http://stackoverflow.com/questions/455084/what-kinds-of-applications-need-to-be-multi-threaded

dmckee 2009-03-25 17:48:34

dual core = 2 CPUs

Amit Kumar 2009-03-25 18:35:19

@Amit Kumar: And you appear to spawn three processes, no? And (3 > 2), right?

dmckee 2009-03-25 18:48:12

not forgetting the original thread which waits on the three, making four.

Pete Kirkham 2009-03-25 19:00:11

Fundamentally I agree. However, the original thread just waits, so it does not add much to the overhead. The results were same when 2 threads instead of 3 were used (as described in the question). The code is bandwidth limited, but context switching is not a major overhead in the current situation.

Amit Kumar 2009-03-26 15:54:42

Answer 3

A:

Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:

a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization

It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...

But this is then only a theoretical standpoint.

Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.

dirkgently 2009-03-25 17:48:46

One algorithm? I assume you meant one processor.

jalf 2009-03-25 18:30:33

Yup. A typo. Thanks.

dirkgently 2009-03-25 18:32:47

> Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Dual core = 2 CPUs.

Amit Kumar 2009-03-25 18:32:56

That was a general comment. Your algorithm isn't parallelized. What else have you done to take advantage of the fact that you have two cores?

dirkgently 2009-03-25 18:37:23

Answer 4

A:

As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.

wekempf 2009-03-25 17:50:23

Answer 5

+3 A:

I believe that your algorithm essentially makes your cache memory useless.

Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.

tvanfosson 2009-03-25 17:51:07

Answer 6

+1 A:

Not related to your threading issues, but there is a bounds error in your code. You have:

for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}

When i is zero you will be doing

arr[0] += arr[-1];

GrahamS 2009-03-25 17:51:38

do not control concurrence for writing at "arr", and forgotten delete c thread.

lsalamon 2009-03-25 17:58:38

yup, my focus in this toy code was just comparing the performance.

Amit Kumar 2009-03-25 18:20:35

yup, correct code is better to evaluate.

lsalamon 2009-03-25 18:36:13

Answer 7

+14 A:

The times you are reporting are measured using the clock function:

The clock() function returns an approximation of processor time used by the program.

$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds

real    0m5.247s
user    0m9.025s
sys 0m0.304s

The real time will be less for multiprocessor tasks, but the processor time will typically be greater.

When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.

It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.

Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:

$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds

real    0m5.326s
user    0m9.057s
sys 0m0.344s

which still is less of a speed-up than one might hope for, but at least the numbers make some sense.

100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)

Pete Kirkham 2009-03-25 17:59:56

> clock() value shows that most of the time some of your processors are waiting for data. My guess is that the processor is busy with no-ops while waiting for data. This shows up in the return value of clock(). But just a guess.

Amit Kumar 2009-03-25 19:28:16

Answer 8

A:

tstenner has got it mostly right.

This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.

Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.

You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.

Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.

user9876 2009-03-25 18:04:45

Amit Kumar 2009-03-25 18:22:21

> You have 3 threads that are trampling all over the same memory. I am not sure whether this is really the case.

Amit Kumar 2009-03-25 18:24:26

Answer 9

A:

Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'

volatilevoid 2009-04-01 11:07:19

ansaurus

tags:

views:

answers:

Negative Speedup on Multithreading my Program

related questions