views:

102

answers:

4

Hi,

I’m using a log-based class in C++ to store very small floating-point values (as the values otherwise go beyond the scope of double). As I’m performing a large number of multiplications, this has the added benefit of converting the multiplications to sums.

However, at a certain point in my algorithm, I need to divide a standard double value by an integer value and than do a *= to a log-based value. I have overloaded the *= operator for my log-based class and the right-hand side value is first converted to a log-based value by running log() and than added to the left-hand side value. Thus the operations actually performed are floating-point division, log() and floating-point summation.

My question whether it would be faster to first convert the denominator to a log-based value, which would replace the floating-point division with floating-point subtraction, yielding the following chain of operations: twice log(), floating-point subtraction, floating-point summation.

In the end, this boils down to whether floating-point division is faster or slower than log(). I suspect that a common answer would be that this is compiler and architecture dependent, so I’ll say that I use gcc 4.2 from Apple on darwin 10.3.0. Still, I hope to get an answer with a general remark on the speed of these two operators and/or an idea on how to measure the difference myself, as there might be more going on here, e.g. executing the constructors that do the type conversion etc.

Cheers!

A: 

The main problem with division is that although it is a single instruction on most modern CPUs it typically has a high latency (31 cycles on PowerPC - not sure what is on x86). Some of this latency can be buried though if you have other non-dependent instructions which can be issued at the same time as the division. So the answer will depend somewhat on what kind of instruction mix and dependencies you have in the loop that contains your divide (not to mention which CPU you are using).

Having said that, my gut feeling is that divide will be faster than a log function on most architectures.

Paul R
+1  A: 

I'm pretty sure that doing a log computation via whatever algorithm is going to be rather more expensive than even FP division would be.

Of course the only way to be sure is to code it up and measure the performance of the code. From your description it sounds like it shouldn't be too difficult to implement both versions and try it side-by-side.

Mark B
+3  A: 

On the x86 architecture, logarithms take significantly longer than divisions: 85 cycles (throughput) for FYL2X compared to 40 cycles for FDIV. I would be surprised if other architectures are much different. Go with the the floating-point division.

Martin B
gcc/darwin (which he's using) uses SSE2 for double-precision arithmetic, not the x87 unit. `log` is implemented in software, and divide uses the `divsd` instruction.
Stephen Canon
@Stephen: Thanks for that additional information -- that would tip the balance even further in favour of div.
Martin B
@Martin B: Actually, it makes the comparison more equal; `divsd` takes 24 cycles on i7, whereas the darwin `log` implementation requires 35 cycles (both timings are *throughput*). On older hardware, they're more equal still, as earlier micro-architectures have slower dividers but the other SSE operations run at approximately the same speed. Divide still comes out ahead, though.
Stephen Canon
@Stephen: By now you _definitely_ deserve the rep for this answer more than I do... ;) I would never have suspected that a software implementation of `log` could get by in 35 cycles... can you shed some light on how it's done? I suspect a lot of shifting and bit-twiddling is involved? Should I open a new question on this ("how is `log` implemented on Darwin")?
Martin B
@Martin B: It's not *that* shockingly fast; 35 cycles is an eternity on a modern processor (consider: an i7 core is capable of issuing 165 instructions in 35 cycles). What's shocking is how slow divide still is =)
Stephen Canon
The darwin libm is open source, distributed under the APSL. Unfortunately, `log` is implemented in a fairly hard-to-read fashion (a single source file supports `log`, `log2`, `log10`, and `log1p`, so you need to unwind a bunch of preprocessor symbols to make sense of it). I would say the most essential speed "tricks" that are used are (1) it detects special cases without needing any floating-point comparisons and (2) it uses vector arithmetic to reduce the latency of the core polynomial evaluation.http://opensource.apple.com/source/Libm/Libm-315/Source/Intel/log_universal.h
Stephen Canon
@Stephen: Thanks!
Martin B
+1  A: 

Do you divide by the same integer multiple times? If so you can instead multiply by 1./yourInteger, and only do the divide once. That would be faster than either if possible.

As to your actual question, it's not only compiler and architecture dependent, but also micro-architecture and data dependent.

On your particular platform (darwin/x86), for current hardware i5/i7: ~24 cycles for divide(1), ~35 cycles for log( )(2). However, because divide only uses a single instruction dispatch slot, the hardware's reorder engine can do other useful computation while the divide is in flight; log( ) is implemented in software, by contrast, and so there is less opportunity for the processor to hoist other computations into the latency of the logarithm. This means that in practice, divide will often be a good bit faster.

1) From the Intel Optimization Manual

2) Measured by calling log( ) in a tight loop and using mach_absolute_time( ) to get wall time.

Stephen Canon
A very good point on multiplying by `1./myInteger`. Wouldn’t have thought of that, thanks!I’m currently running on an ageing 2006 Core 2 Duo, but your calculations are a nice insight and exactly what I was looking for.
Ventzi Zhechev
@Ventzi: I dug out my "aging core 2 duo" to do some timing: 35 cycles for divide, 47 cycles for log.
Stephen Canon