ansaurus

Question

How can I optimize a calculation-intensive C++ program with a known bottleneck?

Answer 1

+2 A:

Your inner loop is calling functions! No matter how trivial they are you pay a heavy penalty. You should try to linearize the matrix accesses (in essence make them 1D) so that you can access them with just pointer dereferencing

vala = *matrixA++; 
valb = *matrixB++;

and since you are dong simple additions and subtractions look at SSE/SSE2 etc depending on your target CPU capabilities and your arithmetic (integer, floating point etc).

EDIT: MMX SSE2 intrinsics are functions that map one to one with CPU SIMD instructions. See these microsoft pages to get started and additionally I suggest looking at the Intel Site for the IA-32/ Intel64 programmers guides or similar manuals from AMD.

I also highly recommend this book on Optimization for Intel Architectures. This will explain all the hidden capabilities of your CPU and compiler..

renick 2010-07-29 11:12:49

they will be very likely inlined...

doc 2010-07-29 11:15:07

Inlining doesn't matter if they do pointer arithmetic in every call (or heaven forbid bounds checking). The linearization is the important hint here

renick 2010-07-29 11:20:13

As I mentioned, in the new version, the matrix data is accessed by a pointer, i.e. I have char * valAptr = (char*)(matrixAdata + matrixAstep * y); in the outside loop and valA = valAptr[x]; in the inside one. I don't quite get the bit about SSE, I activated "Enable Enhanced Instruction Set" and set it to /arch:SSE2 in the solution options, will this do?

neuviemeporte 2010-07-29 11:29:58

@neuviemeporte: This will at least allow the compiler to use the SSE unit instead of the FPU (see my answer for details), but the compiler typically won't be clever enough to auto-vectorize your code. If you can vectorize your code using SSE intrinsics, there's a lot of potential for performance improvement.

Martin B 2010-07-29 11:34:56

Thats what I was saying about pointer arithmetic. Get rid of the multiplications! You can do this by allocating the Matrix(NxM) as a vector(N*M). If you need to use it as a matrix then you can cast it as such. But in this loop you use it as a vector. This will remove one loop. Then look at SSE intrinsics to do more in every cycle

renick 2010-07-29 11:37:48

How do I go about using SSE? I did some assembler programming, but mostly on DOS many years ago, also some microcontrollers at the university, never had any experience with 32bit assembly or MMX/SSE.

neuviemeporte 2010-07-29 14:10:25

See updated answer as it contains some links for reading...

renick 2010-07-29 14:34:14

Answer 2

+2 A:

Can you check the assembler code this loop is generating? If you only get a 25% of processor use, it may be that this loop is memory bound. There are about eight local variables there and I imagine the compiler is not mapping all of them to registers, so that there are many memory operations being done in each loop. One consideration would be tho write that loop in assembler.

Why do you walk the matrix column by column? Matrices will be stored in memory row after row, so if you access a whole column in the inner loop, you are probably requesting more memory loads to your different memory levels (caches and so on).

rturrado 2010-07-29 11:27:14

Um... the OP *is* actually working row by row...

Martin B 2010-07-29 11:29:08

It's hard to know what he's doing, because 'x' runs to 'height' and 'y' runs to 'width', which seems fairly odd.

Will Dean 2010-07-29 11:37:42

Aaaaand... even if the code is memory bound, that won't cause the CPU usage to go below 100% because the cache miss won't cause the process to block. The only scenario in which this could happen is if there were two virtual "hyperthreading" cores sharing the same physical core and there was a process that was maxing out the other virtual core. It seems as if the OP is saying the other cores are idle, so this shouldn't be happening here.

Martin B 2010-07-29 11:43:28

@Will: Aaaah... good catch! In that case, he should definitely check that he's accessing memory sequentially... bad locality can cause an incredible hit to performance.

Martin B 2010-07-29 11:43:50

With boost matrices, it is possible to specify whether you want them to be stored as row-major or column-major matrices. If they are stored as row-major matrices, which is the default for C, the OP should definitely switch the order of the for loops. Looping in column-major order over a matrix stored in row-major order can reduce performace much more than a factor of ten.

Alderath 2010-07-29 11:53:31

I had an error in the code, which I didn't notice because the matrices are usually square, actually the width and height should be reversed, and I am working row by row. Edited to reflect.

neuviemeporte 2010-07-29 13:12:39

Answer 3

+1 A:

If I were in your shoes I would try to find out what exactly is causing the difference in performance between the old and the new code. Perhaps the boost matrices use some sort of caching or lazy/eager evaluation.

StackedCrooked 2010-07-29 11:27:57

The code base has diverged a lot between the releases, the interfaces are entirely different, there's no easy way to relate the routines between the versions.

neuviemeporte 2010-07-29 11:38:18

Answer 4

+2 A:

Various thoughts:

You say that you're only managing to achieve a CPU load of about 25%. I can think of two reasons for this:
1. You're swapping. What is the size of your matrices? Do they fit entirely in physical memory? Look at your application's memory usage and working set size.
2. The rest of your application's code is blocking on I/O. Does the code that surrounds your core routine do any I/O? It could be blocking there for large stretches of time, but of course you're not seeing that using the "pause-when-busy-and-inspect" technique because whenever the process unblocks again, it returns straight into your compute-intensive core routine.
Take a look at the assembly code for your core routine. Does it look reasonable?
Do you actually need to compute diffsum within the loop? It looks as if you could do diffsum=sumA-sumB once outside the loop -- but there may be numerical considerations that prevent you from doing this.
As renick has already commented, this looks like a prime target for SSE optimization. Again, you should make sure the compiler is generating reasonable assembly code (if you're using intrinsics and not writing the assembly yourself).
If you don't want to write SSE code yourself, at least make sure that your compiler's SSE flag is set. This will allow the compiler to use the SSE unit instead of the FPU for scalar floating-point operations, which will by itself will improve performance because the stack-based FPU on the x86 is notoriously ill-suited to compiler code generation.

Martin B 2010-07-29 11:28:15

The matrices are usually sized 31x31, but that can change. Usually no more than 63x63 though. They are generated (cut out or interpolated) from larger matrices, which are actually 1Mpix 8bpp images.

neuviemeporte 2010-07-29 11:49:12

@neuviemeporte: OK, then it doesn't sound as if swapping is an issue here...

Martin B 2010-07-29 11:54:26

Also, the rest of the code doesn't do any I/O, the images are read into memory at the beginning and no user input is necessary during the calculations.

neuviemeporte 2010-07-29 12:42:14

Answer 5

+1 A:

You should also try if you can't multithread the loop via a simple setup like OpenMP. the 25% CPU usage sounds like a quad core running a single worker thread.

rubenvb 2010-07-29 11:30:41

Answer 6

A:

Store matrices with the same parameters, outside of the loop? I think that oughta save you some.

Nick 2010-07-29 11:32:52

Answer 7

A:

"25% of its limit, and with relatively large oscillations, where I would expect steady, 100% load until the program terminates."

You mentioned that function is surrounded by whole bunch of conditions and control branches, so I guess it causes CPU pipeline to be flushed instead of being utilzed efficiently. Try to rewrite your software so that it won't require a lot of branching.

I would also recommend using one of math libraries like Eigen, ATLAS or GSL

doc 2010-07-29 11:33:26

A pipeline flush won't cause the process to block, so you should still be seeing 100% CPU usage -- it just won't be using those cycles as effectively as it could be.

Martin B 2010-07-29 11:36:33

Answer 8

+1 A:

You should try to get rid of your loops, and try vectorizing the operations instead. Using a library like Eigen your code would look something like this:

Eigen::MatrixXd matrixA(height, width);
Eigen::MatrixXd matrixB(height, width);
double sumA = matrixA.sum();
double sumAsq = matrixA.cwise().square().sum();
double sumB = matrixB.sum();
double sumBsq = matrixB.cwise().square().sum();

Eigen::MatrixXd diff = matrixA - matrixB;
double diffSum = diff.sum();
double diffSumSq = diff.cwise().square().sum();

return sumA + sumB / sumAsq + sumBsq * diffSum * diffSumSq;

Kristian 2010-07-29 11:46:07

Answer 9

+1 A:

If you use the "pause" technique, it should tell you more than just that you're in that loop. It should tell you where in the loop.

Never guess when you can just find out. That said, here's my guess :-) You are doing all the summing in floating-point variables, but getting the original numbers as integer characters, right? Then you can expect the conversion from int to double to take some time, and if so you will see your pauses happening in those instructions a good part of the time. So basically I'm wondering why you don't do it all in integer arithmetic.

You say the utilization never goes over 25%. Could that be because it's only using one of the 4 cores?

You say the utilization often drops below 25%. That suggests maybe the thread is blocking to do file I/O. If so, your pauses should catch it in the act and confirm it. If so, you might be able to speed up the I/O by using larger blocks, or possibly doing open/close less often. Note that improvements to your inner loop will shrink the amount of time spent in that loop, but it will not shrink the time spent in I/O, so the percent of time in I/O will increase (causing an apparent decrease in utilization), until you also shrink that.

Utilization is actually not a very useful number. It's only really an indicator of the CPU/IO split, and doesn't tell you at all if you are doing too much of either of those.

As @renick said, get rid of the address calculations. You should be able to step through this loop at the assembly-language level and see it doing nothing more than you would do if you put on your "guru" hat and wrote the assembly yourself.

In any case, vectorizing could be a big win.

Mike Dunlavey 2010-07-29 12:54:39

Changing the summing from doubles to ints cut down mean execution time of this function from 20,5us to 5,91us. Some other changes brought me to 4,83us. Surprisingly, changing the code to a single 1D loop brought a penalty rather than an improvement. Plan on posting a detailed answer as soon as I try everything suggested in here and some of my own ideas.

neuviemeporte 2010-08-16 15:32:04

Answer 10

A:

TLDR: first, optimize the matrix multiplication algorithm, then watch your number of temporaries, then optimize your matrix internal allocations.

Long answer:

I think the most important thing to address is the optimization of your matrices multiplication. The multiplication of matrices for the most intuitive algorithm is O(n^3) (which is huge even for small matrices).

To give you an example, for 2x2 matrix multiplication you have 16 multiplication operations ("mo"). For 3x3 matrix multiplication, you have 27 mo and for 4x4 you have 64 mo.

I'm not sure how it is implemented in your case, but if it's the intuitive algorithm (as a triple for loop), changing that to matrix multiplication using LU decomposed matrices should increase your performance drastically.

This is because once you have the decomposed matrices you can optimize the multiplication algorithm heavily (no sense multiplying rows and columns for the zero elements).

Further, consider using cached values instead of repeating the operations for adding to diffsumsq:

old code:

diffsum+=valA-valB; // compute difference once
diffsumsq+=(valA-valB)*(valA-valB); // compute it two more times, then multiply

new code:

diff = valA-valB; // compute difference once
diffsum+= diff; // use computed difference
diffsumsq+=diff*diff; // just multiply

The second variant is three times faster on the difference computation (two for cycles - x* y operations are executed only once instead of three times).

You can further watch for your number of temporary objects: each binary operation creates a temporary (which means allocating another x*y matrix in memory and copying values). For example the expression:

diffsumsq+=(valA-valB)*(valA-valB);

creates a temporary for the difference in the first paranthesis, then another one for the difference in the second, then another one for the product.

In my example above, you're better writing

diff = valA;
diff -= valB;

instead of

diff = valA - valB;

as this way you avoid allocating the temporary that is assigned to diff (in the second variant).

Another thing to watch would be your memory consumption: since this gets executed a lot, you could either pre-allocate memory, or pool the used matrices and reuse memory instead of creating new objects.

I'm sure there are other things to watch for.

Edit: How can you multiply the matrices? You should have them match by columns x lines. That is, number of columns in valA should be equal to number of lines in valB (if I remember my matrix multiplications correctly).

One more thing:

I made the calculation routine a multi-line #define macro to enforce its inlining and to avoid function calls and returns.

You don't need macros for optimizing C++ code. To avoid function calls and returns use inlined functions. Macros come with their own problems.

utnapistim 2010-07-29 13:35:07

The OP isn't actually doing any matrix multiplication... all of the `*`s are scalar operations.

Martin B 2010-07-29 14:12:34

Oh ... err ... my bad :)

utnapistim 2010-07-29 14:22:10

chuckle .. interesting point, though.

Mike Dunlavey 2010-07-29 15:48:06

That's right, I don't need matrix multiplication, just multiplying to get squares of differences of matrix elements. I just mentioned using matrix multiplication as a test for comparing performance of Boost vs. OpenCV.

neuviemeporte 2010-07-30 15:59:30

ansaurus

tags:

views:

answers:

How can I optimize a calculation-intensive C++ program with a known bottleneck?

related questions