views:

150

answers:

4

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:

B[i][j] = exp(A[i][j])

where i in [0, Ny] and j in [0, Nx].

Notice that this is different from matrix exponential:

B = exp(A)

which can be accomplished with some unstable/unsupported code in GSL (linalg.h).

I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?

EDIT

Results from the solution post of Drew Hall

All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.

  • Results when taking into account the {Row,Column}-Major mode to store the matrix:
    • 226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
    • 223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
    • 224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).

Source code for case 1:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
        matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 2:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
        matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
    }
}

Source code for case 3:

for(i=0; i<Nx; i++)
{
    for(j=0; j<Ny; j++)
    {
        /* Operations to obtain c_value (including exponentiation) */
        gsl_matrix_complex_set(matrix, i, j, c_value);
    }
}
+3  A: 

No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.

Rafe Kettler
Any magical mathematician around? It really looks unefficient to make the double for loop.
Alejandro Cámara
Alejandro: If you want `N_x * N_y` independently calculated exponentials, then `A` is not a matrix anymore but just a bunch of numbers. :-)
kaizer.se
If you want a mathemagician, I'd go to a math forum or something of a similar nature (like www.askdrmath.com) and try their knowledge. It can't hurt to try!
Rafe Kettler
kaize.se: You're right, doesn't smell like improving the double for loop :-( BTW, I sent the question to Dr. Math as Rafe suggested just in case some magical mathematician is online.
Alejandro Cámara
+2  A: 

If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.

However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.

Mike Dunlavey
+5  A: 

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.

In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).

Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (M*N bound). You'll need to get a raw pointer to the underlying memory block (that is, a double* rather than a double**) to do this.

Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).

Drew Hall
Wow, that was a change in the strategy that I didn't think of before. Thanks, I'll write it down and try to apply it to the many nested fors loops that I do.
Alejandro Cámara
Sure. When you've done so, add another comment here (or edit your question) and let us know if you actually got any performance improvement!
Drew Hall
How would he determine if the matrix is stored in a row-major or column-order way? I'd love to be able to use this optimization myself.Also, Drew Hall = `for` loop superhero
Rafe Kettler
@Rafe: Generally code that has a C heritage is row-major, and code that has a Fortran heritage (or interfaces with Fortran libraries such as LAPACK) is column major. Look in the documentation for the matrix library you're using, or if all else fails, try it both ways with a really big matrix and time it--it should be pretty obvious.
Drew Hall
I edited the question to include some results I've obtained. I'll try the other optimizations later.
Alejandro Cámara
Loop unrolling is not a wise thing to do on modern compilers / cpus. Firstly, we haven't been shown the whole of the contents of the loop and secondly, the compiler will have a better idea of the trade off between fewer branches (branches are slow) and increased code size (since it increases memory bandwidth).
Skizz
@Skizz: That may be good advice, but you've really got to measure. I'm assuming that the unrolled loop won't have so much code that it will blow the instruction cache, and have focused on minimizing branches (though I called it loop overhead in my post). You are right that a good optimizing compiler will do loop unrolling for you in many cases--I think his benchmark results probably show evidence of that.
Drew Hall
A: 

Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.

If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.

If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

Skizz