views:

185

answers:

4

I have a program that is running slower than I'd like it to.

I've done some profiling, and I've found the section that is taking up the vast majority of processing time

        DO K = 0, K_MAX
            WRITE(EIGENVALUES_IO, *) K * 0.001 * PI, (W_UP(J), J=1, ATOM_COUNT)
            DCMPLXW_UP(:) = DCMPLX(W_UP(:))
            DO E = 1, ENERGY_STEPS
                ENERGY = MIN_ENERGY + ENERGY_STEP * REAL(E, DP)
                ZV = DCMPLX(ENERGY, DELTA)
                ON_SITE_SINGLE = DCMPLX(0.0_DP)
                DO Q = 1, ATOM_COUNT
                    DO J = 1, ATOM_COUNT
                        ON_SITE_SINGLE(J) = ON_SITE_SINGLE(J) + (MATRIX_UP(J, Q) * MATRIX_UP_CONJG(J, Q)) / (ZV - DCMPLXW_UP(Q))
                    END DO
                END DO
                DOS_DOWN(E) = DOS_DOWN(E) - WEIGHTS(K) * SUM(IMAG(ON_SITE_SINGLE))
            END DO
        END DO

The line

ON_SITE_SINGLE(J) = ON_SITE_SINGLE(J) + (MATRIX_UP(J, Q) * MATRIX_UP_CONJG(J, Q)) / (ZV - DCMPLXW_UP(Q))

Is the one that is doing the damage.

I'm fairly novice at this, is there some way of speeding this up? AFAIK, the same principles apply with C, so any help from you guys too would be nice.

The arrays are all COMPLEX

K_MAX is 1000

ENERGY_STEPS is 1000

ATOM_COUNT is low ( < 50)

+1  A: 

The factors you divide by, namely

(ZV - DCMPLXW_UP(Q))

do not depend on J, only on Q. Hence, I would move this calculation up to the Q loop. Better still, calculate:

1/(ZV - DCMPLXW_UP(Q))

in the outer loop, and multiply by it instead of dividing inside the loop (AFAIR, multiplications are faster than divisions). Also, check that your matrix data structures correspond to the loops (that the loops go over contiguous parts of memory as much as possible). As a rule, if you can improve the algorithm, this will be the greatest run time improvement.

Programming Pearls has a great description of similar optimizations.

Yuval F
+1  A: 

If regular code optimization gets you stuck, you could try OpenMP, which is an API for parallel programming made for C and Fortran. There are some instructions you insert in your code before loops, "pre-compiler" style, and it will split heavy loops across different processes.

You have several instructions you might want to try. For instance :

#pragma omp parallel for
/* Loop here */

This is a very complete API, and you can split everything according to a lot of parameters, shared variables and with different parallel splitting techniques. You can also specify the number of processes you want OpenMP to create, number of cores etc.

With a bit of tuning you'll end up finding a solution increasing your computation speed.

Nicolas C.
I wanted to avoid parallel-ising things just yet, but thanks, didn't know that it would be quite so simple to do that with do loops
Sticky
Sure no problem, but if you want to try it in the future, there it is ! I was impressed : for big computations on matrix, that really improved speed compared to my regular C code.
Nicolas C.
+1  A: 

Please use BLAS for 'vactor plus matrix-vector multiplies'. You're basically doing this in the line

ON_SITE_SINGLE(J) = ON_SITE_SINGLE(J) + (MATRIX_UP(J, Q) * MATRIX_UP_CONJG(J, Q)) / (ZV - DCMPLXW_UP(Q)) 

With well tuned BLAS libraries, you can have significant improvement.

Alexandre C.
I'm already using BLAS elsewhere in this program, but like I said I'm fairly new to it - any hints as to which routine I need to use?
Sticky
*gemv, probably dgemv. Another thing to do is to precompute MATRIX_UP * MATRIX_UP_CONJG into one matrix, 1 / (ZV - DCMPLXW_UP) into one vector, and call *GEMV.
Alexandre C.
+5  A: 

All my programs run slower than I'd like. In all (OK, not all, but many of) my scientific programs there is a deep loop nest in which the innermost statement(s) take up most of the computation time. Typically I expect 90%+ of my computations to be taken up by those statements. That innermost statement of yours is being executed 2.5x10^9 times, so you should expect it to take a significant fraction of the total time.

Bearing this in mind I suggest that you:

a) Take @Alexandre's advice to use BLAS rather than your home-brewed matrix-vector multiplication.

b) ignore @Yuval's advice about lifting operations out of the loop - a good Fortran compiler will do this for you if you turn optimisation up high (WARNING: this is a self-fulfilling prophesy in as much as if the compiler doesn't it is not a good one). There are a lot of other optimisations I expect from a good Fortran these days, see (d). (I don't expect optimisation of memory access by the compiler, I expect that from BLAS.)

c) Form a realistic expectation of how much performance you should be able to get from your program. If you get a sustained FLOPs rate in excess of 10% of the CPUs rated performance you are doing very well and should spend your time doing other things rather than optimisation.

d) Read your compiler documentation very carefully. Make sure that you understand what the optimisation flags actually do. Make sure that you are generating code for the CPUs you are using, not some older variant. Switch in fast vector operations if they are available. All that sort of thing.

e) Start parallelising. OpenMP is a good place to start and, as @Nicolas indicates, the learning curve is quite gentle at first.

Oh, and advice 0, which you seem to have followed, is to measure the code's performance and measure the impact of any changes you make.

High Performance Mark
Thanks, really appreciate it
Sticky
+1 Excellent advice (and that first sentence made me laugh :-)
Rook
Be careful to compare output between optimization levels in Fortran. Aggressive optimization by the compiler can significantly change the results of computations in loops.
Michael Shopsin