views:

131

answers:

2

I just wrote my first OpenMP program that parallelizes a simple for loop. I ran the code on my dual core machine and saw some speed up when going from 1 thread to 2 threads. However, I ran the same code on a school linux server and saw no speed-up. After trying different things, I finally realized that removing some useless printf statements caused the code to have significant speed-up. Below is the main part of the code that I parallelized:

#pragma omp parallel for private(i)
for(i = 2; i <= n; i++)
{
  printf("useless statement");
  prime[i-2] = is_prime(i);
}

I guess that the implementation of printf has significant overhead that OpenMP must be duplicating with each thread. What causes this overhead and why can OpenMP not overcome it?

+1  A: 

Speculating, but maybe the stdout is guarded by a lock?

In general, printf is an expensive operation because it interacts with other resources (such as files, the console and such).

My empirical experience is that printf is very slow on a Windows console, comparably much faster on Linux console but fastest still if redirected to a file or /dev/null.

I've found that printf-debugging can seriously impact the performance of my apps, and I use it sparingly.

Try running your application redirected to a file or to /dev/null to see if this has any appreciable impact; this will help narrow down where the problem lays.

Of course, if the printfs are useless, why are they in the loop at all?

Will
The printf's were there for debugging purposes.
t2k32316
+1  A: 

To expand a bit on @Will's answer ...

I don't know whether stdout is guarded by a lock, but I'm pretty sure that writing to it is serialised at some point in the software stack. With the printf statements included OP is probably timing the execution of a lot of serial writes to stdout, not the parallelised execution of the loop.

I suggest OP modifies the printf statement to include i, see what happens.

As for the apparent speed-up on the dual-core machine -- was it statistically significant ?

High Performance Mark
The apparent speed-up was not shown in a statistically strong way. I just ran it again and had several runs go for about 9 seconds with 1 thread and 7 seconds with 2 threads. But when I ran it on my school's servers, there was no apparent speed-up at all (until I got rid of the printf statements).I guess my main question right now is just one of curiosity: what is going on behind the scenes with printf that it would hinder OMP to parallelize a for-loop? I suppose it's simply like you said: it has to be serialized eventually...
t2k32316