The first thing I tell people is to recognize the difference between
1) timing routines and counting how many times they are called, and
2) finding code that you can fruitfully optimize.
For (1) there are instrumenting profilers.
To be really successful at (2) you need a rare type of profiler.
You need a sampling profiler that
samples the entire call stack, not just the program counter
samples at random wall clock times, not just CPU, so as to capture possible I/O problems
samples when you want it to (not when waiting for user input)
for output, gives you, for each line of code that appears on stack samples, the percent of samples containing that line. That is a direct measure of the total time that could be saved if that line were not there.
(I actually do it by hand, interrupting the program under the debugger.)
Don't get sidetracked by problems you don't have, such as
accuracy of measurement. If a line of code appears on 30% of call stack samples, it's actual cost could be anywhere in a range around 30%. If you can find a way to eliminate it or invoke it a lot less, you will save what it costs, even if you don't know in advance exactly what its cost is.
efficiency of sampling. Since you don't need accuracy of time measurement, you don't need a large number of samples. Even if you get a large number of samples, they don't skew the results significantly, because they don't fail to spot the costly lines of code.
call graphs. They make nice graphics, but are not what you need to know. An arc on a call graph corresponds to a line of code in the best case, usually multiple lines, so knowing cost of an arc only tells the cost of a line in the best case. Call graphs concentrate on functions, when what you need to find is lines of code. Call graphs get wrapped up in the issue of recursion, which is irrelevant.
It's important to understand what to expect. Many programmers, using traditional profilers, can get a 20% improvement, consider that terrific, count the profiler a winner, and stop there. Others, working with large programs, can often get speedup factors of 20 times.
This is done by fixing a series of problems, each one giving a multiplicative speedup factor. As soon as the profiler fails to find the next problem, the process stops. That's why "good enough" isn't good enough.
Here is a brief explanation of the method.