If you're running with on an AMD processor, CodeAnalyst is free and can do that (at least, in time-based profiling); you can actually "zoom" in and out seeing what is taking the most CPU time from processes to functions down to single assembly instructions.
However, keep in mind that to get meaningful results to that resolution with time-based profiling you should run the critical part of the code several times, otherwise the statistics you get doesn't have much sense.
By the way, in my opinion you should forget about the less function calls=>faster idea. If the cost of a function call is bigger than its "payload", the compiler should be able to figure out by itself if it's convenient to inline the call, and in some cases even inlining too much can slow down the code.