views:

318

answers:

4
+1  A: 

Yea.. it's slow. As to why in detail someone else who feels more confident can try to explain.

Want to speed it up ? here : http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/

Milan
Just keep in mind that's an approximation and doesn't actually give you the power.
Billy ONeal
He, neat! I don't need a lot of accuracy here, so this may be an option...
Eamon Nerbonne
+2  A: 

If your code involves some heavy number-crunching, I wouldn't be too surprised that std::pow is consuming 5% of the running time. Many numeric operations are very fast, so a slightly slower operation like std::pow will appear to take more time relative to the other already-fast operations. (That would also account for why you didn't see much improvement switching to std::powf.)

The cache misses are somewhat more puzzling, and it's hard to offer an explanation without more data. One possibility is that if your other code is so memory-intense that it gobbles up all the allocated cache, then it wouldn't be completely surprising that std::pow is taking all the punches on the cache misses.

John Feminella
@John Ferninella: The OP says he has but one call to pow() in his program.
Billy ONeal
@Billy: But he didn't say whether it was in a loop.
Bill
Also - doesn't std::pow act on valarrays? If so, how big are they? And - doesn't VC++ have a habit of doing double arithmetic instead of float? If so, might it be converting valarrays before processing them?
Steve314
@Bill: Good point. (Plus I need to upvote a comment from another "Bill") :)
Billy ONeal
pow's slow, but it shouldn't be that slow - and in any case, why would pow be causeing cache-misses? If the rest of my code where memory intensive, I'd expect each cache miss caused by pow to be matched by a cache-miss later on when it needs to reload that other data - but no, pow causes well over 2/3's of cache misses (the precise amount varies depending on which variant of pow I use or if I use exp/log)
Eamon Nerbonne
@Eamon: It's hard to make predictions about the root cause without code. Are you able to post some?
John Feminella
I tried isolating a test-case: it's not easy. There's probably some interaction going on somewhere - which means it's probably quite difficult to divine for you, not having all the data. In essence, it's a bunch of calls to Eigen(the kde-library)-based matrices that compute the stochastic gradient descent of a cost function I'm interested in. Thanks for the help anyhow - at least I now realize it's unlikely to be `pow`-specific.
Eamon Nerbonne
+1  A: 

If you replace std::pow(var) with another function, like std::max(var, var), does it still take up 5%? Do you still get all the cache misses?

I'm guessing no on time and yes on cache misses. Calculating powers is slower than many other operations (which are you using?). Calling out to code that's not in the cache will cause a cache miss no matter which function it is.

Bill
Cache misses are slow - so if I remove the cache misses, I expect the performace to improve as well. For instance, in this same code, the previous huge source of cache misses was the first access of the input. I now _mm_prefetch the next iterations input before starting this iteration - and no more cache misses!
Eamon Nerbonne
Thanks for the idea of trying other functions - I'll experiment!
Eamon Nerbonne
`std::max` is probably a bad example, as it'll almost certainly get inlined so the cache miss will disappear as well.
jalf
@jalf: good point, I forgot about that. Do you have any suggestions for a function that's less likely to be inlined but will still be measurably faster than `pow`?
Bill
@Eamon: As jalf pointed out, inlining could mess this up. (I assume you're compiling with optimizations including inlining.)
Bill
I tried `sqrt` - using that, overall program execution time is down about 6%, i.e. the abnormal time usage is simply gone.
Eamon Nerbonne
@Eamon: In that case, you may want to go with Zka's suggestion of an approximation algorithm if accuracy is less important than execution time.
Bill
It's a machine learning algorithm and the only use for `pow` is to compute the learning rate - not exactly a value that needs to be super-precise.
Eamon Nerbonne
This got me thinking in the right direction - marked as answer.
Eamon Nerbonne
+1  A: 

Can you give more information on the 'x' as well as the environment where pow is evaluated?

What you are seeing might be the hardware prefetchers at work. Depending on the profiler the allocation of the 'cost' of the different assembly instructions might be incorrect, it should be even more frequent on long latency instructions like the ones needed to evaluate pow.

Added to that, I would use a real profiler like VTune/PTU than the one available in any Visual Studio version.

Fabien Hure
Thanks for the idea! I looked into it, and... most values of x are near 1.0 (perhaps that's extra slow?) I also cross-compile the code in mingw, and there, it turns out it is, anyhow! By changing the code to exp-log style, the overall speed is almost **twice** as fast (!!), so apparently, x near 1.0 really messes up pow on gcc on x64 (http://sourceware.org/ml/libc-help/2009-01/msg00003.html). Perhaps the effect is there for pow on VS.NET too - just differently?
Eamon Nerbonne
Incidentally, what's wrong with the Visual Studio profiler? Looks quite reasonable to me; I'm using the 2010 RC version: http://blogs.msdn.com/profiler/ (I know it's an RC, but "free" is still a lot better price than "way overpriced" - apart from the fact that I don't feel like spending a day mulling over all the different VTUNE licensing options.)
Eamon Nerbonne