Opinion? Yeah. While you're deciding which profiler to buy or not, give this a try.
ADDED: @Max: Step-by-step instructions: The IDE has a "pause" button. Run your app under the IDE, and while it is being subjectively slow, i.e. while you are waiting for it, hit the "pause" button. Then get a snapshot of the call stack.
To snapshot the call stack, what I do is display it (that's one of the debug windows). In the IDE options you can find options for what to display in the stack view. I turn off the option to display the function arguments, because that makes the lines too long. I'm interested in the line number where the call takes place and the name of the function being called. Then, in the call stack view, you can do a "Select All" and then a "Copy", and then paste it into Notepad. It's a little clumsy, I know, but I used to write them down by hand.
I take a few samples this way. Then I look at them for lines that appear on more than one sample, because those are the time-takers. Some are simply necessary, like "call main", but some are not. Those are the gold nuggets. If I don't find any, I keep taking samples, up to about 20. If I still don't find any (very unusual) by that time the program is pretty well optimized. (A key point is that every time you do this the program gets faster, and in the process the remaining performance problems get relatively bigger & easier to find. I.e. not only does the program get faster by a certain ratio R, but the remaining problems get bigger, percentage-wise, by that same ratio.)*
Another thing I do in this process is ask myself what the program is doing and why in that sample. The "why" is very important, because that is how you tell if a line is actually necessary or could be replaced with something less costly. If I'm not sure why it is there, I single-step it a little, maybe look at the data, or maybe let it return up a few levels (shift-F11) until I understand what it's doing. That's about all there is to it.
Existing profilers could help with this process if they actually perform stack sampling, retain the samples, rank lines by what percent of samples contain them, and then let you study individual samples in detail. Maybe they will at some time, but they don't now. They are hung up on issues like efficiency and measurement.
*Suppose your code spends 90% of its time doing X, and 9% of its time doing Y, both not really necessary. Take a small number of samples, and you will see X, but probably not Y. If you fix X you get a 10x speedup. Do the sampling again (you may have to wrap an outer loop around the program so you can take samples). Now you see Y with certainty because now it takes 9% x 10x = 90%. Fixing it gives you another 10x, for overall 100x speedup.