same answer as:
Re-edited: You asked what your options were. If your heart is set on profiling, then look for a profiler.
On the other hand, if you actually have a performance problem to find, the simple method works as well or better than nearly every profiler. I say nearly every, because in some profilers you can actually tease out what you need to know, which is the time-cost attributable to individual instructions, especially call instructions.
The time-cost of an instruction is the amount of time that would be saved if the instruction could be removed, and a good estimate of it is the fraction of call stack samples containing it. You don't need to estimate that fraction with high precision. If the instruction is on 5 out of 10 samples, it's cost is probably somewhere in the range 45% to 55%. No matter - if you could get rid of it, you would save its cost.
So finding performance problems is not hard. Just take a number of call stack samples, collect the set of instructions on those samples, and rank the instructions by the fraction of samples containing them. Among the high-fraction instructions are some that you could optimize away, and you don't have to guess where they are.
I'm simplifying somewhat, because often it is helpful to examine more state information than just the call stack, to see if some work being done is really necessary. But I hope the point is made.
People express doubt that it could work in the presence of recursion, or work on large programs. A little thought (and experimentation) shows those objections do not hold water.