ansaurus

Question

Why is my computer not showing a speedup when I use parallel code?

Answer 1

+8 A:

Most likely your execution time isn't bound by those loops you parallelized.

My suggestion is that you profile your code to see what is taking most of the time. Most engineers will tell you that you should do this before doing anything drastic to optimize things.

T.E.D. 2010-03-08 23:05:06

Answer 2

+2 A:

It's hard to guess without any details. Maybe your application isn't even CPU bound. Did you watch CPU load while your code was running? Did it hit 100% on at least one core?

sfussenegger 2010-03-08 23:06:21

Answer 3

A:

If you are using a lot of memory inside the loop, that might prevent it from being faster. Also you could look into pthread library, to manually handle threading.

Tuomas Pelkonen 2010-03-08 23:08:37

Answer 4

A:

I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on if you don't specify num_threads. For e.g.,

printf("Computing bla %d on core %d/%d ...\n",i+1,omp_get_thread_num()+1,omp_get_max_threads());

The above will work for this pragma #pragma omp parallel for default(none) shared(a,b,c)

This way you can be sure that it's running on both cores since only 2 threads will be created.

Btw, is OpenMP enabled when you're compiling? In Visual Studio you have to enable it in the Property Pages, C++ -> Language and set OpenMP Support to Yes

Jacob 2010-03-08 23:37:09

This shows which thread is doing the work, not which core - my answer above gives details.

Ramashalanka 2010-03-09 02:31:25

You are correct.

Jacob 2010-03-09 03:55:54

Answer 5

+1 A:

Your question is missing some very crucial details such as what the nature of your application is, what portion of it are you trying to improve, profiling results (if any), etc...

Having said that you should remember several critical points when approaching a performance improvement effort:

Efforts should always concentrate on the code areas which have been proven, by profiling, to be the inefficient
Parallelizing CPU bound code will almost never improve performance (on a single core machine). You will be losing precious time on unnecessary context switches and gaining nothing. You can very easily worsen performance by doing this.
Even if you are parallelizing CPU bound code on a multicore machine, you must remember you never have any guarantee of parallel execution.

Make sure you are not going against these points, because an educated guess (barring any additional details) will say that's exactly what you're doing.

Yuval A 2010-03-08 23:51:48

Answer 6

+13 A:

EDIT: Added detail for Grand Central Dispatch in response to OP comment.

While the other answers here are useful in general, the specific answer to your question is that you shouldn't be using clock() to compare the timing. clock() measures CPU time which is added up across the threads. When you split a job between cores, it uses at least as much CPU time (usually a bit more due to threading overhead). Search for clock() on this page, to find "If process is multi-threaded, cpu time consumed by all individual threads of process are added."

It's just that the job is split between threads, so the overall time you have to wait is less. You should be using the wall time (the time on a wall clock). OpenMP provides a routine omp_get_wtime() to do it. Take the following routine as an example:

#include <omp.h>
#include <time.h>
#include <math.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    int i, nthreads;
    clock_t clock_timer;
    double wall_timer;
    for (nthreads = 1; nthreads <=8; nthreads++) {
        clock_timer = clock();
        wall_timer = omp_get_wtime();
        #pragma omp parallel for private(i) num_threads(nthreads)
        for (i = 0; i < 100000000; i++) cos(i);
        printf("%d threads: time on clock() = %.3f, on wall = %.3f\n", \
            nthreads, \
            (double) (clock() - clock_timer) / CLOCKS_PER_SEC, \
            omp_get_wtime() - wall_timer);
    }
}

The results are:

1 threads: time on clock() = 0.258, on wall = 0.258
2 threads: time on clock() = 0.256, on wall = 0.129
3 threads: time on clock() = 0.255, on wall = 0.086
4 threads: time on clock() = 0.257, on wall = 0.065
5 threads: time on clock() = 0.255, on wall = 0.051
6 threads: time on clock() = 0.257, on wall = 0.044
7 threads: time on clock() = 0.255, on wall = 0.037
8 threads: time on clock() = 0.256, on wall = 0.033

You can see that the clock() time doesn't change much. I get 0.254 without the pragma, so it's a little slower using openMP with one thread than not using openMP at all, but the wall time decreases with each thread.

The improvement won't always be this good due to, for example, parts of your calculation that aren't parallel (see Amdahl's_law) or different threads fighting over the same memory.

EDIT: For Grand Central Dispatch, the GCD reference states, that GCD uses gettimeofday for wall time. So, I create a new Cocoa App, and in applicationDidFinishLaunching I put:

struct timeval t1,t2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
for (int iterations = 1; iterations <= 8; iterations++) {
    int stride = 1e8/iterations;
    gettimeofday(&t1,0);
    dispatch_apply(iterations, queue, ^(size_t i) { 
        for (int j = 0; j < stride; j++) cos(j); 
    });
    gettimeofday(&t2,0);
    NSLog(@"%d iterations: on wall = %.3f\n",iterations, \
                t2.tv_sec+t2.tv_usec/1e6-(t1.tv_sec+t1.tv_usec/1e6));
}

and I get the following results on the console:

2010-03-10 17:33:43.022 GCDClock[39741:a0f] 1 iterations: on wall = 0.254
2010-03-10 17:33:43.151 GCDClock[39741:a0f] 2 iterations: on wall = 0.127
2010-03-10 17:33:43.236 GCDClock[39741:a0f] 3 iterations: on wall = 0.085
2010-03-10 17:33:43.301 GCDClock[39741:a0f] 4 iterations: on wall = 0.064
2010-03-10 17:33:43.352 GCDClock[39741:a0f] 5 iterations: on wall = 0.051
2010-03-10 17:33:43.395 GCDClock[39741:a0f] 6 iterations: on wall = 0.043
2010-03-10 17:33:43.433 GCDClock[39741:a0f] 7 iterations: on wall = 0.038
2010-03-10 17:33:43.468 GCDClock[39741:a0f] 8 iterations: on wall = 0.034

which is about the same as I was getting above.

This is a very contrived example. In fact, you need to be sure to keep the optimization at -O0, or else the compiler will realize we don't keep any of the calculations and not do the loop at all. Also, the integer that I'm taking the cos of is different in the two examples, but that doesn't affect the results too much. See the STRIDE on the manpage for dispatch_apply for how to do it properly and for why iterations is broadly comparable to num_threads in this case.

EDIT: I note that Jacob's answer includes

I use the omp_get_thread_num() function within my parallelized loop to print out which core it's working on... This way you can be sure that it's running on both cores.

which is not correct (it has been partly fixed by an edit). Using omp_get_thread_num() is indeed a good way to ensure that your code is multithreaded, but it doesn't show "which core it's working on", just which thread. For example, the following code:

#include <omp.h>
#include <stdio.h>

int main() {
    int i;
    #pragma omp parallel for private(i) num_threads(50)
    for (i = 0; i < 50; i++) printf("%d\n", omp_get_thread_num());
}

prints out that it's using threads 0 to 49, but this doesn't show which core it's working on, since I only have eight cores. By looking at the Activity Monitor (the OP mentioned GCD, so must be on a Mac - go Window/CPU Usage), you can see jobs switching between cores, so core != thread.

Ramashalanka 2010-03-09 01:44:02

Damn good point. I usually to RDTSC for profiling, not clock().

T.E.D. 2010-03-09 13:52:24

Thanks this sounds like what i was looking for but could you explain a little more? Like I was using the clocking outside of the for loop (after the threads had all finished except the calling thread), so that would mean that clock() isn't reliable for getting the time. I'm not quite sure this sounds right. I care about the time itself (in milliseconds or whatever) that it takes the code to run, without much regard for any other statistics. Also, what would the GCD version of omp_get_wtime be?

Jared P 2010-03-10 01:07:17

`clock()` is reliable for getting the *CPU* time which adds up the time used by all of the threads. GCD uses `gettimeofday` for wall time, see details and references above.

Ramashalanka 2010-03-10 04:41:30

Sadly, gettimeofday doesn't always provide the resolution you need for profiling small chunks of code. That's why I tend to use RDTSC instead. Otherwise, this advice is dead-on.

T.E.D. 2010-03-10 17:22:50

@T.E.D.: OK. I reran the results using the timer from @FreeMemory's answer to http://stackoverflow.com/questions/638269/timer-to-find-elapsed-time-in-a-function-call-in-c (which I note you commented on) and got consistent results (to about 3sf, or the millisecond level). Any comments on the criticisms of RDTSC for multicore at http://en.wikipedia.org/wiki/Time%5FStamp%5FCounter? As an aside, if differences are below the microsecond level provided by `gettimeofday`, profiling results tend to be very variable between runs (dependent on caching, what else my computer is doing, etc).

Ramashalanka 2010-03-10 20:22:42

For nanosecond resolution on the mac (which is what the OP is using), there is also `uint64_t mach_absolute_time()` which is consistent to the microsecond with `gettimeofday` and (from my comment above) inconsistent with RDTSC below the millisecond level. This link, http://www.mactech.com/articles/mactech/Vol.22/22.07/OptimizationI/index.html (by Intel employees), discusses differences, and reaffirms my comment above about variable profile results "... using either rdtsc or mach_absolute_time ... A sufficient runtime may be at a minimum on the order of tens or hundreds of seconds."

Ramashalanka 2010-03-10 20:48:02

ansaurus

tags:

views:

answers:

Why is my computer not showing a speedup when I use parallel code?

related questions