Performance of std::pow - cache misses???

views:

318

answers:

+7 Q:

Performance of std::pow - cache misses???

+1 A:

Yea.. it's slow. As to why in detail someone else who feels more confident can try to explain.

Want to speed it up ? here : http://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/

Milan 2010-03-20 19:15:05

Just keep in mind that's an approximation and doesn't actually give you the power.

Billy ONeal 2010-03-20 19:23:27

He, neat! I don't need a lot of accuracy here, so this may be an option...

Eamon Nerbonne 2010-03-20 19:59:16

+2 A:

If your code involves some heavy number-crunching, I wouldn't be too surprised that std::pow is consuming 5% of the running time. Many numeric operations are very fast, so a slightly slower operation like std::pow will appear to take more time relative to the other already-fast operations. (That would also account for why you didn't see much improvement switching to std::powf.)

The cache misses are somewhat more puzzling, and it's hard to offer an explanation without more data. One possibility is that if your other code is so memory-intense that it gobbles up all the allocated cache, then it wouldn't be completely surprising that std::pow is taking all the punches on the cache misses.

John Feminella 2010-03-20 19:15:21

@John Ferninella: The OP says he has but one call to pow() in his program.

Billy ONeal 2010-03-20 19:21:58

@Billy: But he didn't say whether it was in a loop.

Bill 2010-03-20 19:42:14

Also - doesn't std::pow act on valarrays? If so, how big are they? And - doesn't VC++ have a habit of doing double arithmetic instead of float? If so, might it be converting valarrays before processing them?

Steve314 2010-03-20 19:46:47

@Bill: Good point. (Plus I need to upvote a comment from another "Bill") :)

Billy ONeal 2010-03-20 19:47:48

pow's slow, but it shouldn't be that slow - and in any case, why would pow be causeing cache-misses? If the rest of my code where memory intensive, I'd expect each cache miss caused by pow to be matched by a cache-miss later on when it needs to reload that other data - but no, pow causes well over 2/3's of cache misses (the precise amount varies depending on which variant of pow I use or if I use exp/log)

Eamon Nerbonne 2010-03-20 19:54:31

@Eamon: It's hard to make predictions about the root cause without code. Are you able to post some?

John Feminella 2010-03-20 20:02:47

I tried isolating a test-case: it's not easy. There's probably some interaction going on somewhere - which means it's probably quite difficult to divine for you, not having all the data. In essence, it's a bunch of calls to Eigen(the kde-library)-based matrices that compute the stochastic gradient descent of a cost function I'm interested in. Thanks for the help anyhow - at least I now realize it's unlikely to be `pow`-specific.

Eamon Nerbonne 2010-03-20 23:48:48

+1 A:

If you replace std::pow(var) with another function, like std::max(var, var), does it still take up 5%? Do you still get all the cache misses?

I'm guessing no on time and yes on cache misses. Calculating powers is slower than many other operations (which are you using?). Calling out to code that's not in the cache will cause a cache miss no matter which function it is.

Bill 2010-03-20 19:44:23

Cache misses are slow - so if I remove the cache misses, I expect the performace to improve as well. For instance, in this same code, the previous huge source of cache misses was the first access of the input. I now _mm_prefetch the next iterations input before starting this iteration - and no more cache misses!

Eamon Nerbonne 2010-03-20 19:57:36

Thanks for the idea of trying other functions - I'll experiment!

Eamon Nerbonne 2010-03-20 19:58:14

`std::max` is probably a bad example, as it'll almost certainly get inlined so the cache miss will disappear as well.

jalf 2010-03-20 20:41:26

@jalf: good point, I forgot about that. Do you have any suggestions for a function that's less likely to be inlined but will still be measurably faster than `pow`?

Bill 2010-03-20 21:03:53

@Eamon: As jalf pointed out, inlining could mess this up. (I assume you're compiling with optimizations including inlining.)

Bill 2010-03-20 21:04:50

I tried `sqrt` - using that, overall program execution time is down about 6%, i.e. the abnormal time usage is simply gone.

Eamon Nerbonne 2010-03-20 21:51:33

@Eamon: In that case, you may want to go with Zka's suggestion of an approximation algorithm if accuracy is less important than execution time.

Bill 2010-03-20 22:43:41

It's a machine learning algorithm and the only use for `pow` is to compute the learning rate - not exactly a value that needs to be super-precise.

Eamon Nerbonne 2010-03-20 23:06:24

This got me thinking in the right direction - marked as answer.

Eamon Nerbonne 2010-05-07 09:22:21

+1 A:

Can you give more information on the 'x' as well as the environment where pow is evaluated?

What you are seeing might be the hardware prefetchers at work. Depending on the profiler the allocation of the 'cost' of the different assembly instructions might be incorrect, it should be even more frequent on long latency instructions like the ones needed to evaluate pow.

Added to that, I would use a real profiler like VTune/PTU than the one available in any Visual Studio version.

Fabien Hure 2010-03-21 15:59:11

Thanks for the idea! I looked into it, and... most values of x are near 1.0 (perhaps that's extra slow?) I also cross-compile the code in mingw, and there, it turns out it is, anyhow! By changing the code to exp-log style, the overall speed is almost **twice** as fast (!!), so apparently, x near 1.0 really messes up pow on gcc on x64 (http://sourceware.org/ml/libc-help/2009-01/msg00003.html). Perhaps the effect is there for pow on VS.NET too - just differently?

Eamon Nerbonne 2010-03-22 08:52:44

Incidentally, what's wrong with the Visual Studio profiler? Looks quite reasonable to me; I'm using the 2010 RC version: http://blogs.msdn.com/profiler/ (I know it's an RC, but "free" is still a lot better price than "way overpriced" - apart from the fact that I don't feel like spending a day mulling over all the different VTUNE licensing options.)

Eamon Nerbonne 2010-03-22 10:03:43

ansaurus

tags:

views:

answers:

Performance of std::pow - cache misses???

related questions