views:

261

answers:

7

Yes, I am using a profiler (ANTS). But at the micro-level it cannot tell you how to fix your problem. And I'm at a microoptimization stage right now. For example, I was profiling this:

for (int x = 0; x < Width; x++)
{
    for (int y = 0; y < Height; y++)
    {
        packedCells.Add(Data[x, y].HasCar);
        packedCells.Add(Data[x, y].RoadState);
        packedCells.Add(Data[x, y].Population);
    }
}

ANTS showed that the y-loop-line was taking a lot of time. I thought it was because it has to constantly call the Height getter. So I created a local int height = Height; before the loops, and made the inner loop check for y < height. That actually made the performance worse! ANTS now told me the x-loop-line was a problem. Huh? That's supposed to be insignificant, it's the outer loop!

Eventually I had a revelation - maybe using a property for the outer-loop-bound and a local for the inner-loop-bound made CLR jump often between a "locals" cache and a "this-pointer" cache (I'm used to thinking in terms of CPU cache). So I made a local for Width as well, and that fixed it.

From there, it was clear that I should make a local for Data as well - even though Data was not even a property (it was a field). And indeed that bought me some more performance.

Bafflingly, though, reordering the x and y loops (to improve cache usage) made zero difference, even though the array is huge (3000x3000).

Now, I want to learn why the stuff I did improved the performance. What book do you suggest I read?

+8  A: 

CLR via C# by Jeffrey Richter.

It is such a great book that someone stolen it in my library together with C# in depth.

Pierre 303
+1 - Someone always beats me to the punch whenever i try to post this particular answer :-)
InSane
@In Sane - I feel your pain. Scooped me too.
Justin
@In Sane and @Justin - ditto. I'll add that you _must_ read this book multiple times. It's like a great movie, there's gems of awesome that you only get after the 4th time reading it.
SnOrfus
That's true ! C# in depth is focused on the language, and clr via c# on the clr
Pierre 303
Now that got me thinking... if I read this book while thinking about optimization only, I'll probably miss out on a lot. How do I avoid that? I mean, I'm not writing a CLR implementation, so can I get any value out of this book other than learning to optimize better?
Stefan Monov
Read it 5 times as suggested by SnOrfus
Pierre 303
+4  A: 

The CLR is not involved at all here, this should all be translated to straight machine code without calls into the CLR. The JIT compiler is responsible for generating that machine code, it has an optimizer that tries to come up with the most efficient code. It has limitations, it cannot spend a large amount of time on it.

One of the important things it does is figuring out what local variables should be stored in the CPU registers. That's something that changed when you put the Height property in a local variable. It possibly decided to store that variable in a register. But now there's one less available to store another variable. Like the x or y variable, one that's critical for speed. Yes, that will slow it down.

You got a bad diagnostic about the outer loop. That could possibly be caused by the JIT optimizer re-arranging the loop code, giving the profiler a harder time mapping the machine code back to the corresponding C# statement.

Similarly, the optimizer might have decided that you were using the array inefficiently and switched the indexing order back. Not so sure it actually does that, but not impossible.

Anyhoo, the only way you can get some insight here is by looking at the generated machine code. There are many decent books about x86 assembly code, although they might be a bit hard to find these days. Your starting point is Debug + Windows + Disassembly.

Keep in mind however that even the machine code is not a very good predictor of how efficient code is going to run. Modern CPU cores are enormously complicated and the machine code is no longer representative for what actually happens inside the core. The only tried and true way is what you've already been doing: trial and error.

Hans Passant
Beware using the debugger to view the assembly, because [using a debugger disables optimizations](http://stackoverflow.com/questions/3645333/how-do-i-learn-enough-about-clr-to-make-educated-guesses-about-performance-proble/3645496#3645496): am I right?
ChrisW
Thanks, that was some very useful info.
Stefan Monov
@ChrisW - turn off the "Suppress JIT optimization on module load" option.
Hans Passant
+3  A: 

Albin - no. Honestly I didn't think that running outside a profiler would change the performance difference, so I didn't bother. You think I should have? Has that been a problem for you before? (I am compiling with optimizations on though)

Running under a debugger changes the performance: when it's being run under a debugger, the just-in-time compiler automatically disables optimizations (to make it easier to debug)!

If you must, use the debugger to attach to an already-running already-JITted process.

ChrisW
Good to know! But I was talking about running under a *profiler* (not debugger).
Stefan Monov
@Stefan Monov - I don't know whether a profiler has the same problem. I hope not! It suppose theoretically it may vary/depend on the profiler implementation.
ChrisW
@Stefan Monov - Actually if a profiler is giving you details about each line, then it's being quite intrusive, and I don't see how it could possibly avoid changing the cacheing characteristics of the algorithm. So, final optimizations should perhaps be done without an intrusive profiler: instead, measure the whole method (e.g. using the `Stopwatch` class) instead of individual lines.
ChrisW
Ok, I just realized that all of the optimizations described in my original question turn out to be PESSIMIZATIONS when running in release mode (I have always been using Debug mode so far!). Also release mode speeds up my app dramatically, and I hadn't even thought of using it. So... big thanks.
Stefan Monov
+1  A: 

Hm, I don't think that the loop enrolling is the real problem. 1. I'd try to avoid accessing the array Data three times per inner loop. 2. I'd also recommend, to re-think the three Add statements: you are apparently accessing a collection three times to add trivial some data. Make it only one access per iteration and add a data type containing three entries:

for (int y = 0; ... {
 tTemp = Data[x, y];
 packedCells.Add(new {
  tTemp.HasCar, tTemp.RoadState, tTemp.Population 
 });
}

Another look reveals, that you are basically vectorizing a matrix by copying it into an array (or some other sequential collection)... Is that necessary at all? Why don't you just define a specialized indexer which simulates that linear access? Even better, if you only need to enumerate the entries (in that example you do, no random access required), why don't you use an adequate LINQ expression?

paul_71
The reason I'm vectorizing the matrix is because then I can use BinaryFormatter to serialize the vectorized representation extremely *fast* and receive a very *small size*. By contrast, when I serialize the matrix directly, speed is orders of magnitude worse, and size is 3 times larger. That's also the reason I'm adding each element separately - this way, my vectorized representation is a List<byte> rather than List<MyStruct>, so, again, it's serialized much faster and produces a much smaller file. The reason I don't use LINQ here is that it's, again, much slower. I have profiled all of this.
Stefan Monov
Hm, still I'd recommend a 'Vectorizer' adapter, which simulates the linear access to your matrix. You would present this adapter to the Formatter for serializing, without ever creating a temporary List, or running through the ineffective (it does nothing except of copying) quadratic loop... With such an adapter this loop will be run only once: at the moment it is really needed - and - no additional memory will be required.
paul_71
Then this Vectorizer class would need to implement ISerializable.GetObjectData, and what would I put in that method exactly? Note that I can't serialize an IEnumerable unless the concrete type implementing IEnumerable supports serialization.
Stefan Monov
++ Whenever I see things like *new* and *Add* my guesser gets kicked into gear. Nevertheless, samples would prove where the problem really is.
Mike Dunlavey
+2  A: 

One thing you should know about working with Arrays is that the CLR will always make sure that array-indices are not out-of-bounds. It has an optimization for 1-dimensional arrays but not for 2+ dimensions.

Knowing this, you may want to benchmark MyCell Data[][] instead of MyCell Data[,]

Henk Holterman
A: 

Since the comment might be overseen, I repeat myself: it is quite cumbersome to optimize code which is per se overfluous. You do not really need to explicitely linearize your matrix at all, see the comment above: Define a linearizing adapter which implements IEnumerable<MyCell> and feed it into the formatter.

I am getting a warning when I try to add another answer, so I am going to recycle this one.. :) After reading Steve's comments and thinking about it for a while, I suggest the following: If serializing a multi-dimensional array is too slow (haven't tryied, I just believe you...) don't use it at all! It appears, that your matrix is not sparse and has fixed dimensions. So define the structure holding your cells as simple linear array with indexer:

[Serializable()]
class CellMatrix {
 Cell [] mCells;

 public int Rows { get; }
 public int Columns { get; }

 public Cell this (int i, int j) {
  get {
   return mCells[i + Rows * j];    
  }   
  // setter...
 }

 // constructor taking rows/cols...
}

A thing like this should serialize as fast as native Array does... I don't recommend hard coding the layout of Cell in order to save few bytes there...

Cheers, Paul

paul_71
+1  A: 

Point 1) Educated guesses are not the way to do performance tuning. In this case I can guess about as well as most, but guessing is the wrong way to do it.

Point 2) Profilers need to be well understood before you know what they're actually telling you. Here's a discussion of the issues. For example, what many profilers do is tell you "where the program spends its time", i.e. where the program counter spends its time, so they are almost absolutely blind to time requested by function calls, which is what your inner loop seems to consist of.

I do a lot of performance tuning, and here is what I do. I cycle between two activities:

  • Overall time measurement. This doesn't require special tools. I'm not trying to measure individual routines.

  • "Bottleneck" location. This does not require running the code at any kind of speed, because I'm not measuring. What I'm doing is locating lines of code that are responsible for a significant percent of time. I know which lines they are because they are on the stack for that percent, and stack samples easily find them.

Once I find a "bottleneck" and fix it, I go back to the first step, measure what percent of time I saved, and do it all again on the next "bottleneck", typically from 2 to 6 times. I am helped by the "magnification effect", in which a fixed problem magnifies the percentage used by remaining problems. It works for both macro and micro optimization.

(Sorry if I can't write "bottleneck" without quotes, because I don't think I've ever found a performance problem that resembled the neck of a bottle. Rather they were all simply doing things that didn't really need to be done.)

Mike Dunlavey
Sampling the stack does not tell you what line of code is running, just what the function call stack looks like. Sampling the CPU's program counter tells you what line of code is running.
Gabe
@Gabe: It absolutely does tell you both of those. Every line of code would, if removed or otherwise made to take no time, reduce overall time by a certain percent. During that percent, it is on the stack for you to see. That's why you take enough samples to see it. You don't need many.
Mike Dunlavey
Mike: The stack is a record of *where you've been*, not *where you are*. Since the stack only gets written when you make a function call (push) or return from a function (pop), there's no way for it to know what line of code is currently executing.
Gabe
@Gabe: It is a record of *why* the PC is where it is, and *why* it is spending the time it's spending, which allows you to tell if it really needs to be done or not. Check this link: http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343
Mike Dunlavey
@Gabe: Sorry. That's one of the ideas I encounter all the time - that any function call casually written, costing from 100s to billions of cycles, must have been unquestionably necessary, else why would it have been coded? That's like my saying "I'm sure my code has no bugs, because why would I put bugs in my code?" To me, performance problems are just another kind of bug, only far, far easier to find, because they expose themselves.
Mike Dunlavey
Mike: I'm confused because in the link you gave, you clearly point out the difference between PC sampling and stack sampling, even referring to the call stack as the "extended program counter". This seems to imply that you know that the PC isn't on the call stack, yet now you are arguing that the stack somehow tells you what line of code is executing.
Gabe
@Gabe: If you're in an IDE, the app's running, you hit "pause", and display the call stack, you see the line where the PC is currently "at" (unless it's in the system somewhere). Next to that you see the line where the it was called from. If that's in readable code, you can see why the PC is where it is. Next to *that* you see the call that got it *there*, and so on. So yes, I consider the PC as part of the stack. Every one of the call sites on the stack sample is something you can look at to see if you can get rid of, and that's just as good, maybe a better ...
Mike Dunlavey
@Gabe: ... way, to cause the PC not to spend the time where you find it. Of course, you take several samples, because if a call site is only on one sample, that doesn't tell much, but if it's on 2 or more samples, it could really save you something if you could find a way to remove it or call it a lot less. ALSO, suppose your app spends a good fraction of time in blocking system calls like I/O. Then your PC is somewhere meaningless, but the stack tells exactly why you're there, and the same optimization principles apply.
Mike Dunlavey
@Gabe: Maybe another way to answer is that *every* instruction is a call instruction (microcode), so the PC is just part of the call stack. The fact that you can't see the microcode part of the stack is beside the point, because the only code you can optimize is on the part of the stack you *can* see.
Mike Dunlavey
@Gabe: Of course if you *only* look at the PC, you miss all those juicy calls higher up that you could optimize away. Another way to say it is, the program traces out a call tree. PC sampling only shows you leaves, while stack sampling also shows you branches and trunks. Optimizing is pruning the tree, right? So wouldn't you rather find whole branches you could lop off, rather than just leaves?
Mike Dunlavey
I'm not trying to say how you should do performance optimizations, just that from a CPU's point of view, the stack and the PC are two very different things (even though you could consider them both part of the same abstract thing called the "Call Stack"). Obviously if you were capturing the stack, you would want to capture the PC at the same time, but having the ability to see the PC in no way implies that you can see the stack.
Gabe
@Gabe: I understand. I was just trying to answer the practical question.
Mike Dunlavey
@Mike: re "Educated guesses are not the way to do performance tuning." then how would you have found the bottleneck in my posted code? Is it the loop-counter-increment? The less-than operator? The property access? A CPU cache problem? Not enough space in registers? etc
Stefan Monov
@Stefan: If the things you're mentioning were not guesses, you would know the answer, so if you (or anyone else) doesn't know the answer, then they are guesses. I can guess too, but guesses don't work. The way to answer the question is to *find out*, not guess. I can't be any more explicit about how to do that than I have been above and elsewhere. If you're not used to doing that, I'm just recommending that you do. You'll never need to guess again: http://stackoverflow.com/questions/2624667/whats-a-very-easy-c-profiler-vc/2624725#2624725
Mike Dunlavey
@Stefan: As an aside, I'm a student pilot, and every time there's a plane crash in the news, I like to wonder what went wrong, and people love to ask me. I can read about the circumstances and what the news reports say, and come up with juicy theories, mostly saying how the people involved screwed up. But months later when the official accident investigation comes out, the *real* reason turns out to be, most of the time, either a whole lot more than I guessed or something else entirely. When I "profile" the same thing happens.
Mike Dunlavey