Years ago I built a profiler, and described it on SO in answer to some other question I can't locate right now, about how to build a profiler.
It is based on at least partially automating the technique I have used for decades, of which an example is given here. It is based on stack sampling, and the key is in how that information is presented, and the thought process that the user goes through.
The general beliefs about performance tuning, that are taught in school (by professors with little exposure to real-world software) and continued via the 50,000-programmers-can't-be-wrong phenomenon, I suggest need to be questioned and put on a firmer footing. I am far from alone in feeling this way, as you might gather from cruising around SO.
I think profiler technology is gradually evolving (for the better in my opinion) toward stack-sampling and ways to explore the results. Here are the insights I depend on (which you might find a little jarring):
Uncovering performance problems so they can be fixed, and measuring performance, are two entirely different tasks. They are means and ends, and should not be confused.
To uncover performance problems, what is needed is to find what activities account for large amounts of wall-clock time being spent and that can be replaced with something faster.
The nice thing about such activities is, the very fact that they take time exposes them to random-time samples of the state of the program.
Not many samples are needed, if they are taken during the interval you care about. I.e. there's no point in taking samples while waiting for user input. To that end, in my profiler, I let the user trigger the samples with keys.
The reason you don't need many samples is this. Any given performance problem costs some fraction X of the wall-clock time in the interval of interest. A random sample in that interval has an X probability of "catching it in the act", so if N samples are taken, the expected number of samples catching it in the act are NX. The standard deviation of that number of samples is sqrt(NX(1-X)). Example, if N = 20, and X = 20%, you can expect roughly 2 to 6 samples to show the problem. That gives you an imprecise measure of the problem, but it does tell you it is worth fixing, and it gives you a very precise location, without any further detective work.
Problems typically manifest as more function, procedure, or method calls than necessary, especially as the software gets large, with more layers of abstraction, and thus more stack layers. The first thing I look for is call sites (not functions, but call statements or instructions) that appear on multiple stack samples. The more stack samples they appear on, the more they cost. The second thing I look for is "could they be replaced?" If they absolutely cannot be replaced with something faster then they are simply necessary and I need to look elsewhere. But often they can be replaced, and I get a nice speedup. So I'm looking carefully at particular stack samples, not aggregating them into measurements.
Recursion is not an issue because the principle that the cost of an instruction is the percentage of time it is on the stack is the same, even if it calls itself.
This is not something I do once, but in successive passes. Each problem I fix makes the program take less time. That means that other problems become larger fractions of the time, making them easier to find. This effect compounds, so that dramatic cumulative performance improvements are often possible.
I could go on, but I wish you luck, because I think there is a need for better profiling tools, and you have a good chance.