views:

270

answers:

10

I am developing some scientific software for my university. It is being written in C++ on Windows (VS2008). The algorithm must calculate some values for a large number of matrix pairs, that is, at the core resides a loop iterating over the matrices, collecting some data, e.g.:

sumA = sumAsq = sumB = sumBsq = diffsum = diffsumsq = return = 0;
for (int y=0; y < height; ++y)
{
    for (int x=0; x < width; ++x)
    { 
        valA = matrixA(x,y);
        valB = matrixB(x,y);
        sumA+=valA;
        sumAsq+=valA*valA;
        sumB+=valB;
        sumBsq+=valB*valB;
        diffsum+=valA-valB;
        diffsumsq+=(valA-valB)*(valA-valB);
    }
}
return = sumA + sumB / sumAsq + sumBsq * diffsum * diffsumsq

This routine is executed millions of times for different matrixA, matrixB pairs. My problem is that this program is extremely slow, compiled in Release mode with all optimizations activated. Using the "pause-when-busy-and-inspect" debugger technique, I established that the program sits inside this loop virtually every time, even though, as you might expect, this routine is surrounded by a whole bunch of conditions and control branches. What puzzles me the most is that during its execution on a dual-processor Xeon-based system, the program utilizes one of the 4 cores (no surprise, it is single-threaded for now) but only up to about 25% of its limit, and with relatively large oscillations, where I would expect steady, 100% load until the program terminates.

The current version is actually a rewrite, created with optimizing the performance in mind. I was devastated when I found out it's actually slower than the original. The previous version used Boost matrices, which I replaced by OpenCV matrices, after having established them to be over 10 times faster in comparing the execution time of multiplying two 1000x100 matrices. I access the matrix by manually dereferencing a raw pointer to its data which I hoped would gain me some performance. I made the calculation routine a multi-line #define macro to enforce its inlining and to avoid function calls and returns. I improved the math behind the calculations so that the final value is calculated in a single pass through the matrices (the old version requires two passes). I expected to get huge gains and yet the opposite is true. I'm nowhere near my old program's efficiency, not to mention commercial software for the particular application.

I was wondering if it perhaps had something to do with the matrix data being 8-bit chars, I once saw that access to floats was actually slower than to doubles in my old program, perhaps chars are even slower since the processor retrieves data in 32-bit chunks (this Xeon probably grabs even 64bits). I also considered turning the matrices into vectors to avoid a loop-inside-loop construct, as well as some form of vectorization, like for example calculating the data for 4 (less? more?) consecutive matrix cells on a single loop iteration. Any other ideas please?

EDIT: actual code in the new, OpenCV based version:

const char *Aptr, *Bptr;
double sumA = 0, sumB = 0, sumAsq = 0, sumBsq = 0, diffsum = 0, diffsumsq = 0;
char Aval, Bval;

for (int y=0; y < height; ++y)
{
    Aptr = (char*)(AMatrix.imageData + AMatrix.widthStep * y);
    Bptr = (char*)(BMatrix.imageData + BMatrix.widthStep * y);
    for (int x=0; x < width; ++x)
    {
        Aval = Aptr[x];
        Bval = Bptr[x];

        sumA+=Aval;
        sumB+=Bval;
        sumAsq+=Aval*Aval;
        sumBsq+=Bval*Bval;
        diffsum+=Aval-Bval;
        diffsumsq+=(Aval-Bval)*(Aval-Bval);
    }
}
+2  A: 

Your inner loop is calling functions! No matter how trivial they are you pay a heavy penalty. You should try to linearize the matrix accesses (in essence make them 1D) so that you can access them with just pointer dereferencing

vala = *matrixA++; 
valb = *matrixB++; 

and since you are dong simple additions and subtractions look at SSE/SSE2 etc depending on your target CPU capabilities and your arithmetic (integer, floating point etc).

EDIT: MMX SSE2 intrinsics are functions that map one to one with CPU SIMD instructions. See these microsoft pages to get started and additionally I suggest looking at the Intel Site for the IA-32/ Intel64 programmers guides or similar manuals from AMD.

I also highly recommend this book on Optimization for Intel Architectures. This will explain all the hidden capabilities of your CPU and compiler..

renick
they will be very likely inlined...
doc
Inlining doesn't matter if they do pointer arithmetic in every call (or heaven forbid bounds checking). The linearization is the important hint here
renick
As I mentioned, in the new version, the matrix data is accessed by a pointer, i.e. I have char * valAptr = (char*)(matrixAdata + matrixAstep * y); in the outside loop and valA = valAptr[x]; in the inside one. I don't quite get the bit about SSE, I activated "Enable Enhanced Instruction Set" and set it to /arch:SSE2 in the solution options, will this do?
neuviemeporte
@neuviemeporte: This will at least allow the compiler to use the SSE unit instead of the FPU (see my answer for details), but the compiler typically won't be clever enough to auto-vectorize your code. If you can vectorize your code using SSE intrinsics, there's a lot of potential for performance improvement.
Martin B
Thats what I was saying about pointer arithmetic. Get rid of the multiplications! You can do this by allocating the Matrix(NxM) as a vector(N*M). If you need to use it as a matrix then you can cast it as such. But in this loop you use it as a vector. This will remove one loop. Then look at SSE intrinsics to do more in every cycle
renick
How do I go about using SSE? I did some assembler programming, but mostly on DOS many years ago, also some microcontrollers at the university, never had any experience with 32bit assembly or MMX/SSE.
neuviemeporte
See updated answer as it contains some links for reading...
renick
+2  A: 

Can you check the assembler code this loop is generating? If you only get a 25% of processor use, it may be that this loop is memory bound. There are about eight local variables there and I imagine the compiler is not mapping all of them to registers, so that there are many memory operations being done in each loop. One consideration would be tho write that loop in assembler.

Why do you walk the matrix column by column? Matrices will be stored in memory row after row, so if you access a whole column in the inner loop, you are probably requesting more memory loads to your different memory levels (caches and so on).

rturrado
Um... the OP *is* actually working row by row...
Martin B
It's hard to know what he's doing, because 'x' runs to 'height' and 'y' runs to 'width', which seems fairly odd.
Will Dean
Aaaaand... even if the code is memory bound, that won't cause the CPU usage to go below 100% because the cache miss won't cause the process to block. The only scenario in which this could happen is if there were two virtual "hyperthreading" cores sharing the same physical core and there was a process that was maxing out the other virtual core. It seems as if the OP is saying the other cores are idle, so this shouldn't be happening here.
Martin B
@Will: Aaaah... good catch! In that case, he should definitely check that he's accessing memory sequentially... bad locality can cause an incredible hit to performance.
Martin B
With boost matrices, it is possible to specify whether you want them to be stored as row-major or column-major matrices. If they are stored as row-major matrices, which is the default for C, the OP should definitely switch the order of the for loops. Looping in column-major order over a matrix stored in row-major order can reduce performace much more than a factor of ten.
Alderath
I had an error in the code, which I didn't notice because the matrices are usually square, actually the width and height should be reversed, and I am working row by row. Edited to reflect.
neuviemeporte
+1  A: 

If I were in your shoes I would try to find out what exactly is causing the difference in performance between the old and the new code. Perhaps the boost matrices use some sort of caching or lazy/eager evaluation.

StackedCrooked
The code base has diverged a lot between the releases, the interfaces are entirely different, there's no easy way to relate the routines between the versions.
neuviemeporte
+2  A: 

Various thoughts:

  • You say that you're only managing to achieve a CPU load of about 25%. I can think of two reasons for this:
    1. You're swapping. What is the size of your matrices? Do they fit entirely in physical memory? Look at your application's memory usage and working set size.
    2. The rest of your application's code is blocking on I/O. Does the code that surrounds your core routine do any I/O? It could be blocking there for large stretches of time, but of course you're not seeing that using the "pause-when-busy-and-inspect" technique because whenever the process unblocks again, it returns straight into your compute-intensive core routine.
  • Take a look at the assembly code for your core routine. Does it look reasonable?
  • Do you actually need to compute diffsum within the loop? It looks as if you could do diffsum=sumA-sumB once outside the loop -- but there may be numerical considerations that prevent you from doing this.
  • As renick has already commented, this looks like a prime target for SSE optimization. Again, you should make sure the compiler is generating reasonable assembly code (if you're using intrinsics and not writing the assembly yourself).
  • If you don't want to write SSE code yourself, at least make sure that your compiler's SSE flag is set. This will allow the compiler to use the SSE unit instead of the FPU for scalar floating-point operations, which will by itself will improve performance because the stack-based FPU on the x86 is notoriously ill-suited to compiler code generation.
Martin B
The matrices are usually sized 31x31, but that can change. Usually no more than 63x63 though. They are generated (cut out or interpolated) from larger matrices, which are actually 1Mpix 8bpp images.
neuviemeporte
@neuviemeporte: OK, then it doesn't sound as if swapping is an issue here...
Martin B
Also, the rest of the code doesn't do any I/O, the images are read into memory at the beginning and no user input is necessary during the calculations.
neuviemeporte
+1  A: 

You should also try if you can't multithread the loop via a simple setup like OpenMP. the 25% CPU usage sounds like a quad core running a single worker thread.

rubenvb
A: 

Store matrices with the same parameters, outside of the loop? I think that oughta save you some.

Nick
A: 

"25% of its limit, and with relatively large oscillations, where I would expect steady, 100% load until the program terminates."

You mentioned that function is surrounded by whole bunch of conditions and control branches, so I guess it causes CPU pipeline to be flushed instead of being utilzed efficiently. Try to rewrite your software so that it won't require a lot of branching.

I would also recommend using one of math libraries like Eigen, ATLAS or GSL

doc
A pipeline flush won't cause the process to block, so you should still be seeing 100% CPU usage -- it just won't be using those cycles as effectively as it could be.
Martin B
+1  A: 

You should try to get rid of your loops, and try vectorizing the operations instead. Using a library like Eigen your code would look something like this:

Eigen::MatrixXd matrixA(height, width);
Eigen::MatrixXd matrixB(height, width);
double sumA = matrixA.sum();
double sumAsq = matrixA.cwise().square().sum();
double sumB = matrixB.sum();
double sumBsq = matrixB.cwise().square().sum();

Eigen::MatrixXd diff = matrixA - matrixB;
double diffSum = diff.sum();
double diffSumSq = diff.cwise().square().sum();

return sumA + sumB / sumAsq + sumBsq * diffSum * diffSumSq;
Kristian
+1  A: 

If you use the "pause" technique, it should tell you more than just that you're in that loop. It should tell you where in the loop.

Never guess when you can just find out. That said, here's my guess :-) You are doing all the summing in floating-point variables, but getting the original numbers as integer characters, right? Then you can expect the conversion from int to double to take some time, and if so you will see your pauses happening in those instructions a good part of the time. So basically I'm wondering why you don't do it all in integer arithmetic.

You say the utilization never goes over 25%. Could that be because it's only using one of the 4 cores?

You say the utilization often drops below 25%. That suggests maybe the thread is blocking to do file I/O. If so, your pauses should catch it in the act and confirm it. If so, you might be able to speed up the I/O by using larger blocks, or possibly doing open/close less often. Note that improvements to your inner loop will shrink the amount of time spent in that loop, but it will not shrink the time spent in I/O, so the percent of time in I/O will increase (causing an apparent decrease in utilization), until you also shrink that.

Utilization is actually not a very useful number. It's only really an indicator of the CPU/IO split, and doesn't tell you at all if you are doing too much of either of those.

As @renick said, get rid of the address calculations. You should be able to step through this loop at the assembly-language level and see it doing nothing more than you would do if you put on your "guru" hat and wrote the assembly yourself.

In any case, vectorizing could be a big win.

Mike Dunlavey
Changing the summing from doubles to ints cut down mean execution time of this function from 20,5us to 5,91us. Some other changes brought me to 4,83us. Surprisingly, changing the code to a single 1D loop brought a penalty rather than an improvement. Plan on posting a detailed answer as soon as I try everything suggested in here and some of my own ideas.
neuviemeporte
A: 

TLDR: first, optimize the matrix multiplication algorithm, then watch your number of temporaries, then optimize your matrix internal allocations.

Long answer:

I think the most important thing to address is the optimization of your matrices multiplication. The multiplication of matrices for the most intuitive algorithm is O(n^3) (which is huge even for small matrices).

To give you an example, for 2x2 matrix multiplication you have 16 multiplication operations ("mo"). For 3x3 matrix multiplication, you have 27 mo and for 4x4 you have 64 mo.

I'm not sure how it is implemented in your case, but if it's the intuitive algorithm (as a triple for loop), changing that to matrix multiplication using LU decomposed matrices should increase your performance drastically.

This is because once you have the decomposed matrices you can optimize the multiplication algorithm heavily (no sense multiplying rows and columns for the zero elements).

Further, consider using cached values instead of repeating the operations for adding to diffsumsq:

old code:

diffsum+=valA-valB; // compute difference once
diffsumsq+=(valA-valB)*(valA-valB); // compute it two more times, then multiply

new code:

diff = valA-valB; // compute difference once
diffsum+= diff; // use computed difference
diffsumsq+=diff*diff; // just multiply

The second variant is three times faster on the difference computation (two for cycles - x* y operations are executed only once instead of three times).

You can further watch for your number of temporary objects: each binary operation creates a temporary (which means allocating another x*y matrix in memory and copying values). For example the expression:

diffsumsq+=(valA-valB)*(valA-valB);

creates a temporary for the difference in the first paranthesis, then another one for the difference in the second, then another one for the product.

In my example above, you're better writing

diff = valA;
diff -= valB;

instead of

diff = valA - valB;

as this way you avoid allocating the temporary that is assigned to diff (in the second variant).

Another thing to watch would be your memory consumption: since this gets executed a lot, you could either pre-allocate memory, or pool the used matrices and reuse memory instead of creating new objects.

I'm sure there are other things to watch for.

Edit: How can you multiply the matrices? You should have them match by columns x lines. That is, number of columns in valA should be equal to number of lines in valB (if I remember my matrix multiplications correctly).

One more thing:

I made the calculation routine a multi-line #define macro to enforce its inlining and to avoid function calls and returns.

You don't need macros for optimizing C++ code. To avoid function calls and returns use inlined functions. Macros come with their own problems.

utnapistim
The OP isn't actually doing any matrix multiplication... all of the `*`s are scalar operations.
Martin B
Oh ... err ... my bad :)
utnapistim
chuckle .. interesting point, though.
Mike Dunlavey
That's right, I don't need matrix multiplication, just multiplying to get squares of differences of matrix elements. I just mentioned using matrix multiplication as a test for comparing performance of Boost vs. OpenCV.
neuviemeporte