views:

418

answers:

6

Hello there,

I had a routine that was performing good. However, I had to make a change to it. The change improved the routine's precision but hurt the performance.

The routine is lots of math calculations and is probably CPU bond (I still have to do more rigorous testing on this, but I'm 99% sure). It is written in C++ (compiler is Borland C++ 6).

I want to measure the performance of the routine now, first I thought about measuring the execution time, but that is a kind of flawed approach in my opinion since there could be much more things going on.

I ran into this topic then: Techniques to measure application performance - Stack Overflow. I liked the idea of measuring through MFlops.

My boss suggested to try to use some kind of measurement by cpu clock cycles, so the tests would be machine-independent, however, I think this approach falls into the MFlops testing.

In my opinion measuring both things (time of execution and MFlops) is the way to go, but I would like to hear from the stackoverflow experts what do you guys think.

What is the way to go to measure performance of a routine that is known as CPU bond?

+2  A: 

I agree with your boss - measure in terms of cpu clock cycles. Be aware though that there could be other things going on such as a lot of cache misses that are slowing your code down. If you can, use VTune or one of the free tools from Intel to pinpoint the nature of the bottleneck.

Stephen Doyle
CPU clock cycles are not necessarily meaningful.
David Thornley
Yep David - that's what I meant by other things going on such as cache misses! As with any measurement, a single measurement is rarely meaningful and needs to be put into context.
Stephen Doyle
+4  A: 

CPU clock cycles don't mean that much either, if your application is memory-bound. On a faster CPU, you'll just spend more CPU cycles waiting on the same cache miss. (Mathematical apps are probably not I/O bound).

Another problem is that the number of clock cycles for a certain instruction sequence will still vary across architectures (and that even includes between Intel Core1 / Core2). So, as an absolute measure of performance, clock cycles on one CPU is hardly an improvement.

I would argue they're in fact worse as a measure. Unlike time, users don't care about cycles. This matters especially with modern multi-core CPUs. An "inefficient" algorithms using twice the number of cycles and 3 cores will finish in 67% of the time. Users will probably like that.

MSalters
+2  A: 

Your question implies that the software is already as fast as it could possibly go, except for the precision issue. I have found that that is not often the case, and I'm assuming that what you really want is to make it that fast.

I would suggest that measuring is missing the point.

What you really need to do is locate the statements or instructions (not functions) 1) that are responsible for a significant fraction of the wall-clock time, and 2) that you can find a way to optimize.

Assuming the software is of a non-trivial size, chances are it has at least a few layers of function calls, and it is quite possible that a few of these function calls (not functions, function calls) are responsible for significant time-fraction and could be optimized.

This is a very good way to locate them, and this is an example of its use.

Mike Dunlavey
I know I can count on a downvote if I swim against the tide by saying that _catching_ costly _instructions_ is more effective than _measuring_ the cost of _functions_, even with call-graphs.
Mike Dunlavey
+1  A: 

Measuring execution time is the way to go.

In this case, I think you want to minimize what you are measuring to reduce the number of variables.

Next it would be a good idea to run a baseline of some sort to calibrate that particular machine. Either use the last checked in version or some sort of intensive routine that roughly matches the type of computation you are trying to measure. Then you can express the benchmark as

relative_time = measured_time_for_routine / measured_time_for_baseline
Alan Jackson
+2  A: 

CPU clock cycles aren't machine-independent nowadays, even with CPUs that use the same instruction set. The x86 (or whatever) machine code will be sliced and diced in all sorts of different ways. The days when this meant anything are long gone (and, back when CPU cycles meant something, there were so many different CPU types in use that it was machine-dependent anyway).

Not to mention that CPU-bound isn't as clear as it used to be, what with cache misses and all. It used to be that a CPU-bound process was one that was limited only by I/O and such, since a memory access would take a certain number of CPU cycles.

What you're trying to measure is performance, which I take means how fast it runs. In that case, you're probably best off measuring wall-clock time, repeating the calculation enough times that you get significant results. You could create a testing harness that would run through different implementations, so you'd get comparable results.

David Thornley
+1  A: 

You can measure in terms of CPU hardware counters, VTune Intel profiles is pretty good at this. it will show you detailed information based on the CPU counters ( Instruction Retired, Cache Misses, Branch Mispredication), it will also correlate this with each statement in your functions, so you will have a pretty good idea about what is taking the most cost.

this is assuming that your function is not memory bound.

Thanks

mfawzymkh