views:

47

answers:

3

I've been given a 2D matrix representing temperature points on the surface of a metal plate. The edges of the matrix (plate) are held constant at 20 degrees C and there is a constant heat source of 100 degrees C at one pre-defined point. All other grid points are initially set to 50 degrees C.

My goal is to take all interior grid points and compute its steady-state temperature by iteratively averaging over the surrounding four grid points (i+1, i-1, j+1, j-1) until I reach convergence (a change of less than 0.02 degrees C between iterations).

As far as I know, the order in which I iterate over the grid points is irrelevant.

To me, this sounds like a fine time to invoke the Fortran FORALL construct and explore the joys of parallelization.

How can I ensure that the code is indeed being parallelized?

For example, I can compile this on my single-core PowerBook G4 and I would expect no improvement in speed due to parallelization. But if I compile on a Dual Core AMD Opteron, I would assume that the FORALL construct can be exploited.

Alternatively, is there a way to measure the effective parallelization of a program?

Update

In response to M.S.B's question, this is with gfortran version 4.4.0. Does gfortran support automatic multi-threading?

That's remarkable that the FORALL construct has been rendered obsolete by, I suppose, what is then auto-vectorization.

Perhaps this is best for a separate question, but how does auto-vectorization work? Is the compiler able to detect that only pure functions or subroutines are being used in a loop?

+1  A: 

The best way is to measure the clock time of the calculation. Try it with and without parallel code. If the clock time decreases, then your parallel code is working. The Fortran intrinsic system_clock, called before and after the code block, will give you the clock time. The intrinsic cpu_time will give you the cpu time, which might go up when code in run multi-threaded due to overhead.

The lore is the FORALL is not as useful as was thought when introduced into the language -- that it is more of a initialization construct. Compilers are equally adept at optimizing regular loops.

Fortran compilers vary in their abilities to implement true parallel processing without it being explicitly specified, e.g., with OpenMP or MPI. What compiler are you using?

To get automatic multi-threading, I've used ifort. Manually, I've used OpenMP. With both of these, you can compile your program with and without the parallelization and measure the difference.

M. S. B.
+1  A: 

If you use Intel Fortran Compiler, you can use a command line switch to turn on/increase the compliler's verbosity level for parallelization/vectorization. This way during compilation/linking you will be shown something like:

FORALL loop at line X in file Y has been vectorized

I admit that it has been a few of years since the last time I used it, so the compiler message might actually look very different, but that's the basic idea.

exfizik
I have to get my hands on ifort to see what the exact message is, but this kind of verbosity is exactly what I was looking for! Even for cases of auto-vectorization, I'd like to know which looks are being parallelized and which aren't, particularly for cases where I would assume parallelization should have been possible.
CmdrGuard
+1  A: 

FORALL is an assignment construct, not a looping construct. The semantics of FORALL state that the expression on the right hand side (RHS) of each assignment within the FORALL is evaluated completely before it is assigned to the left hand side (LHS). This has to be done no matter how complex the operations on the RHS, including cases where the RHS and the LHS overlap.

Most compilers punt on optimizing FORALL, both because it is difficult to optimize and because it is not commonly used. The easiest implementation is to simply allocate a temporary for the RHS, evaluate the expression and store it in the temporary, then copy the result into the LHS. Allocation and deallocation of this temporary is likely to make your code run quite slowly. It is very difficult for a compiler to automatically determine when the RHS can be evaluated without a temporary; most compilers don't make any attempt to do so. Nested DO loops turn out to be much easier to analyze and optimize.

With some compilers, you may be able to parallelize evaluation of the RHS by enclosing the FORALL with the OpenMP "workshare" directive and compiling with whatever flags are necessary to enable OpenMP, like so:

!$omp parallel workshare FORALL (i=,j=,...) END FORALL !$omp end parallel

gfortran -fopenmp blah.f90 -o blah

Note that a compliant OpenMP implementation (including at least older versions of gfortran) is not required to evaluate the RHS in parallel; it is acceptable for an implementation to evaluate the evaluation as though it is enclosed in an OpenMP "single" directive. Note also that the "workshare" likely will not eliminate the temporary allocated the RHS. This was the case with an old version of the IBM Fortran compiler on Mac OS X, for instance.

Brian
Hmmm. I had never considered the complexity of the RHS as affecting the possibility for parallelization. Your point, then, is very clear regarding why compilers might punt on optimizing a FORALL loop.
CmdrGuard

related questions