views:

993

answers:

11

Sometimes I see and have used the following variation for a fast divide in C++ with floating point numbers.

// orig loop
double y = 44100.0;
for(int i=0; i<10000; ++i) {
double z = x / y;
}

// alternative
double y = 44100;
double y_div = 1.0 / y;

for(int i=0; i<10000; ++i) {
double z = x * y_div;
}

But someone hinted recently that this might not be the most accurate way.

Any thoughts?

+1  A: 

multiplication is faster than division so the second method is faster. It might be slightly less accurate but unless you are doing hard core numerics the level of accuracy should be more than enough.

second
Thanks, I'm doing high quality audio processing where accuracy is important.
Stephen Blinkhorn
+5  A: 

Wikipedia agrees that this can be faster. The linked article also contains several other fast division algorithms that might be of interest.

I would guess that any industrial-strength modern compiler will make that optimization for you if it is going to profit you at all.

mwigdahl
Industrial-strength compilers will not make optimizations like that, if they may change program output; after all, an optimization that gives the wrong answer is a poor optimization. Because we have no language-level mechanism for specifying the precision we *care* about, the compiler may only assume we care about all of it.
Tom
+2  A: 

Your original

// original loop:
double y = 44100.0;

for(int i=0; i<10000; ++i) {
    double z = x / y;
}

can easily be optimized to

// haha:
double y = 44100.0;
double z = x / y;

and the performance is pretty nice. ;-)

EDIT: People keep voting this down, so here's the not so funny version:

If there were a general way to make division faster for all cases, don't you think compiler writers might have happened upon it by now? Of course they would have done. Also, some of the people doing FPU circuits aren't exactly stupid, either.

So the only way you're going to get better performance is to know what specific situation you have at hand and doing optimal code for that. Most likely this is a complete waste of your time, because your program is slow for some other reason such as performing math on loop invariants. Go find a better algorithm instead.

dwc
wow that's an amazing optimization. so, obviously, I was doing pseudo code and x would vary over time.
Stephen Blinkhorn
That's not right, 'z' can't be seen outside the loop scope... :-P
TomWij
humor is good and all, but I'm sure you'll get some great answers here in addition to my joke answer.
dwc
You're right. Using the -O3 flag in GCC results in zero performance gain. Without the flag the optimization it is about 5-6 times faster. I've seen exceptional programmers avoiding divides in loops but you are right again - the only way to benchmark is to do specific tests on specific data.But, if you read the *question* I actually posted you'll find that I asked about accuracy not performance :)
Stephen Blinkhorn
Good point you have there. :)
dwc
Compilers don't optimize division into a multiplication, because there can be a small difference in the result. Hence this is one of the cases where the programmer has to decide if he/she wants to optimize or not.
Accipitridae
A: 

I are looping 10,000 times simply to make the code take long enough to measure the time easily? Or do you have 10000 numbers to divide by the same number? If the former, put the "y_div = 1.0 / y;" inside the loop, because it's part of the operation.

If the latter, yes, floating point multiplication is generally faster than division. Don't change your code from the obvious to the arcane based on guesses, though. Benchmark first to find slow spots, and then optimize those (and take measurements before and after to make sure your idea actually causes an improvement)

KeyserSoze
The whole point of doing it outside the loop is that you only need to do it once - multiplication by the reciprocal is (floating point errors aside) equivalent to division.
Andy Mikula
+13  A: 

On just about every CPU, a floating point divide is several times as expensive as a floating point multiply, so multiplying by the inverse of your divisor is a good optimization. The downside is that there is a possibility that you will lose a very small portion of accuracy on certain processors - eg, on modern x86 processors, 64-bit float operations are actually internally computed using 80 bits when using the default FPU mode, and storing it off in a variable will cause those extra precision bits to be truncated according to your FPU rounding mode (which defaults to nearest). This only really matters if you are concatenating many float operations and have to worry about the error accumulation.

Not Sure
"storing it off in a variable will cause" - is it guaranteed to cause that, or are compilers allowed to carry double values as represented in the FPU where possible? Also you can probably use long double if rounding worries you.
Steve Jessop
I'm not 100% certain, but it depends on the exact code. The x87 FPU register stack holds the full 80 bits per value, so if the compiler can keep the intermediate inverse-divisor value on the FPU register stack, then no precision should be lost. However, the compiler may not be able to keep that value in an FPU register stack and will have to store it on the program stack, at which point the precision will be truncated to 64 bits. If that is a concern though, the assembly really should be hand coded, instead of being left up to the compiler.
Not Sure
FWIW: m68k series processors had opcodes for storing, retrieving and manipulating 80bit floats. Cost some speed.
dmckee
values internally computed in 82 bits, and stored on fpu stack in 80bit format.But x87 is not the default mode for floating calculations any more. SSE2 for modern CPUs is.
osgx
A: 

On old CPUs like the 80286, floating point maths was abysmally slow and we employed lots of trickiness to speed things up.

On modern CPUs floating point maths is blindingly fast and optimising compilers can generally do wonders with fine-tuning things.

It is almost never worth the effort to employ little micro-optimisations like that.

Try to make your code simple and idiot-proof. Only of you find a real bottleneck (using a profiler) would you think of optimisations in your floating point calculations.

Michael J
On modern CPUs like the Core2 Duo, floating-point math still takes non-zero time, and hence is still worth optimizing for CPU-bound problems.
Tom
The question was about precision loss. OP works obviously on an audio software (44100 Hz), and when it makes any digital signal processing software, like the one I used to work on, 2 to 10 times faster, then all so called little micro optimisations not only worth the effort, but are critical.
Jem
Unless you've profiled the code you don't know if you have a bottleneck.If you are doing millions of computations, you need to consider how you access your data. By optimising that you can often avoid page faults that are probably much slower than the computations.Even better is if you can improve your algorithm to reduce the number of computations.You can probably find cases where changing "divides" to "multiplies" improves performance, but they will be pretty rare. Your number one concern is to write maintainable code - and little tricks are not your friend when you're doing that.
Michael J
No, I strongly disagree. Everyone in charge of a heavy numeric computation software involving floating point must know and use this way to divide. It does not degrade the code maintainability. Your number one concern here is the number of voices you can simultaneously handle, or the number of seconds your customers have to wait.And even if you had bottlenecks and page faults (which were probably eliminated long ago), why wasting CPU cycles that could be exploited by other thread / processes, when you can FOR FREE make your processing units significantly faster ?
Jem
+1  A: 

Audio, hunh? It's not just 44,100 divisions per second when you have, say, five tracks of audio running at once. Even a simple fader consumes cycles, after all. And that's just for a fairly bare-bones, minimal example -- what if you want to have, say, an eq and a compressor? Maybe a little reverb? Your total math budget, so to speak, gets eaten up quickly. It does make sense to wring out a little extra performance in those cases.

Profilers are good. Profilers are your friend. Profilers deserve blowjobs and pudding. But you already know where the main bottle neck is in audio work -- it's in the loop that processes samples, and the faster you can make that, the happier your users will be. Use everything you can! Multiply by reciprocals, shift bits whenever possible (exp(x*y) = exp (x)*exp(y), after all), use lookup tables, refer to variables by reference instead of values (less pushing/popping on the stack), refactor terms, and so forth. (If you're good, you'll laugh at these elementary optimizations.)

MIke V
A: 

I presume from the original post that x is not a constant shown there but probably data from an array so x[i] is likely to be the source of the data and similarly for the output, it will be stored somewhere in memory.

I suggest that if the loop count really is 10,000 as in the original post that it will make little difference which you use as the whole loop won't even take a fraction of millisecond anyway on a modern cpu. If the loop count really is very much higher, perhaps 1,000,000 or more, then I would expect that the cost of memory access would likely make the faster operation completely irrelevent anyway as it will always be waiting for the data anyway.

I suggest trying both with your code and testing if it actually makes any significant difference in run time and if it doesn't then just write the straightforward division if that's what the algorithm needs.

John Burton
+2  A: 

In your example using gcc the division with the options -O3 -ffast-math yields the same code as the multiplication without -ffast-math. (In a testing environment with enough volatiles around that the loop is still there.)

So if you really want to optimise those divisions away and don’t care about the consequences, that’s the way to go. Multiplication seems to be roughly 15 times faster, btw.

Debilski
A: 

When processing audio, I prefer to use fixed point math instead. I suppose this depends on the level of precision you need. But, let's assume that 16.16 fixed point integers will do (meaning high 16 bits is whole number, low 16 is the fraction). Now, all calculation can be done as simple integer math:

unsigned int y = 44100 << 16;
unsigned int z = x / (y >> 16);  // divisor must be the whole number portion

Or with macros to help:

#define FP_INT(x) (x << 16)
#define FP_MUL(x, y) (x * (y >> 16))
#define FP_DIV(x, y) (x / (y >> 16))

unsigned int y = FP_INT(44100);
unsigned int z = FP_MUL(x, y);
spoulson
A: 

here's the problem with doing it with a reciprocal, you still have to do the division before you can actually divide by Y. unless your only dividing by Y then i suppose this may be useful. this is not very practical since division is done in binary with similar algorithms.

joe