This is what I use. Although it is not suitable for live production use, it answers your other needs.
For live production use, you need something that samples the stack. In my opinion, it's OK if it has some small overhead. My goal is to discover the activities that need optimization, and for that I'm willing to pay a temporary price in speed.
There is always one or more intervals of interest, like the interval between when a request is received, and the response goes out. It's surprising how few samples you need in such an interval to find out what's taking the time.
High precision of timing is not needed. If there is something X going on that, through optimization, would save you, say, 50% of the interval, that is roughly the fraction of samples that will show you X.