Are the old tricks (lookup table, approx functions) for creating faster implementations of sqrt() still useful, or is the default implementation as fast as it is going to get with modern compilers and hardware?
Sqrt is basically unchanged on most systems. It's a relatively slow operation, but the total system speeds have improved, so it may not be worth trying to use "tricks".
The decision to optimize it with approximations for the (minor) gains this can achieve are really up to you. Modern hardware has eliminated some of the need for these types of sacrifices (speed vs. precision), but in certain situations, this is still valuable.
I'd use profiling to determine whether this is "still useful".
If you have proven that the call to sqrt() in your code is a bottleneck with a profiler then it may be worth trying to create an optimizated version. Otherwise it's a waste of time.
Rule 1: profile before optimizing
Before investing any effort in the belief that you can beat the optimizer, you must profile everything and discover where the bottleneck really lies. In general, it is unlikely that sqrt()
itself is your bottleneck.
Rule 2: replace the algorithm before replacing a standard function
Even if sqrt()
is the bottleneck, then it is still reasonably likely that there are algorithmic approaches (such as sorting distances by length squared which is easily computed without a call to any math function) that can eliminate the need to call sqrt()
in the first place.
What the compiler does for you if you do nothing else
Many modern C compilers are willing to inline CRT functions at higher optimization levels, making the natural expression including calls to sqrt()
as fast as it needs to be.
In particular, I checked MinGW gcc v3.4.5 and it replaced a call to sqrt()
with inline code that shuffled the FPU state and at the core used the FSQRT
instruction. Thanks to the way that the C standard interacts with IEEE 754 floating point, it did have to follow the FSQRT
with some code to check for exceptional conditions and a call to the real sqrt()
function from the runtime library so that floating point exceptions can be handled by the library as required by the standard.
With sqrt()
inline and used in the context of a larger all double
expression, the result is as efficient as possible given the constraints of of standards compliance and preservation of full precision.
For this (very common) combination of compiler and target platform and given no knowledge of the use case, this result is pretty good, and the code is clear and maintainable.
In practice, any tricks will make the code less clear, and likely less maintainable. After all, would you rather maintain (-b + sqrt(b*b - 4.*a*c)) / (2*a)
or an opaque block of inline assembly and tables?
Also, in practice, you can generally count on the compiler and library authors to take good advantage of your platform's capabilities, and usually to know more than you do about the subtleties of optimizations.
However, on rare occasions, it is possible to do better.
One such occasion is in calculations where you know how much precision you really need and also know that you aren't depending on the the C standard's floating point exception handling and can get along with what the hardware platform supplies instead.
Edit: I rearranged the text a bit to put emphasis on profiling and algorithms as suggested by Jonathan Leffler in comments. Thanks, Jonathan.
Edit2: Fixed precedence typo in the quadratic example spotted by kmm's sharp eyes.
I find it very hard to believe that the sqrt function is your application's bottleneck because of the way modern computers are designed. Assuming this isn't a question in reference to some crazy low end processor, you take a tremendous speed hit to access memory outside of your CPU caches, so unless you're algorithm is doing math on a very few numbers (enough that they all basically fit within the L1 and L2 caches) you're not going to notice any speed up from optimizing any of your arithmetic.
It is generally safe to assume that the standard library developers are quite clever, and have written performant code. You're unlikely to be able to match them in general.
So the question becomes, do you know something that'll let you do a better job? I'm not asking about special algorithms for computing the square root (the standard library developers knows of these too, and if they were worthwhile in general, they'd have used them already), but do you have any specific information about your use case, that changes the situation?
Do you only need limited precision? If so, you can speed it up compared to the standard library version, which has to be accurate.
Or do you know that your application will always run on a specific type of CPU? Then you can look at how efficient that CPU's sqrt instruction is, and see if there are better alternatives. Of course, the downside to this is that if I run your app on another CPU, your code might turn out slower than the standard sqrt().
Can you make assumptions in your code, that the standard library developers couldn't?
You're unlikely to be able to come up with a better solution to the problem "implement an efficient replacement for the standard library sqrt".
But you might be able to come up with a solution to the problem "implement an efficient square root function for this specific situation".
This probably is the fastest method of computing the square root:
float fastsqrt(float val) {
union
{
int tmp;
float val;
} u;
u.val = val;
u.tmp -= 1<<23; /* Remove last bit so 1.0 gives 1.0 */
/* tmp is now an approximation to logbase2(val) */
u.tmp >>= 1; /* divide by 2 */
u.tmp += 1<<29; /* add 64 to exponent: (e+127)/2 =(e/2)+63, */
/* that represents (e/2)-64 but we want e/2 */
return u.val;
}
This probably is the fastest method of computing the inverse square root. Assume at most 0.00175228 error.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
return x*(1.5f - xhalf*x*x);
}
This is (very roughly) about 4 times faster than (float)(1.0/sqrt(x))