views:

284

answers:

4

In an app I'm profiling, I found that in some scenarios this function is able to take over 10% of total execution time.

I've seen discussion over the years of faster sqrt implementations using sneaky floating-point trickery, but I don't know if such things are outdated on modern CPUs.

MSVC++ 2008 compiler is being used, for reference... though I'd assume sqrt is not going to add much overhead though.

See also here for similar discussion on modf function.

EDIT: for reference, this is one widely-used method, but is it actually much quicker? How many cycles is SQRT anyway these days?

+9  A: 

You're very likely to gain more speed improvements by changing your algorithms, than by changing their implementations: Try to call sqrt() less instead of making calls faster. (And if you think this isn't possible - the improvements for sqrt() you mention are just that: improvements of the algorithm used to calculate a square root.)

Since it is used very often, it is likely that your standard library's implementation of sqrt() is nearly optimal for the general case. Unless you have a restricted domain (e.g., if you need less precision) where the algorithm can take some shortcuts, it's very unlikely someone comes up with an implementation that's faster.

Note that, since that function uses 10% of your execution time, even if you manage to come up with an implementation that only takes 75% of the time of std::sqrt(), this still will only bring your execution time down by 2,5%. For most applications users wouldn't even notice this, except if they use a watch to measure.

sbi
An example of this might be: if you want to find the nearest thing to you, you can compare the distances squared instead of taking the sqrt of each, since distance*distance > distance. Or you might be able to run a pre-process step which calculates the distance pairs of everything in advance. (Obviously I'm imagining some sort of 2-D or 3-D data structure).
stusmith
we're a bit beyond such trivialities in this case, I think actual values are genuinely needed rather than only being used in comparisons.
John
+1 For the realization that a big improvement to a piece of code used less frequently ends up with almost a negligible improvement in the *Big Picture*.
Thomas Matthews
+8  A: 

Yes, it is possible even without trickery:

1) sacrifice accuracy for speed: the sqrt algorithm is iterative, re-implement with fewer iterations.

2) lookup tables: either just for the start point of the iteration, or combined with interpolation to get you all the way there.

3) caching: are you always sqrting the same limited set of values? if so, caching can work well. I've found this useful in graphics applications where the same thing is being calculated for lots of shapes the same size, so results can be usefully cached.

Autopulated
I always find it hard to believe that manually doing even a small number of iterations could be faster than a built-in SQRT instruction... but then I guess SQRT isn't magic, it still does iterations inside.
John
Do you have any kind of metrics... how much improvement you have seen?
John
Milage varies with usage :)You really have to profile your own usage scenario to see what works.Regarding the fsqrt instruction, you might be able to still use it, but speed things up a tiny bit by not checking for error conditions: depends on exactly what assembler your compiler is generating though.
Autopulated
+1  A: 

How accurate do you need your sqrt to be? You can get reasonable approximations very quickly: see Quake3's excellent inverse square root function for inspiration (note that the code is GPL'ed, so you may not want to integrate it directly).

jemfinch
How fast is it though? If 1/sqrt is no good and you need sqrt, is the additional division still faster than the normal version?
John
Try it and see.
jemfinch
I figured someone might already have done that before they recommend it...
John
@John: They have. That's why they created that function, after all. But that doesn't mean it would help (much) in your case.
sbi
@John: I can't test on your system, and system variation with the sort of floating point munging done in the referenced function is too significant a variable to ignore.
jemfinch
The so-called "fast inverse square root" is *not* "fast" on modern hardware. On nearly any processor designed in the last 10 years, there is a faster alternative. On many, the hardware square root instruction will be faster. Many have an even faster hardware inverse square root estimate (`rsqrtss` on SSE, `rsqrte` on ARMv7, etc).
Stephen Canon
@jemfinch... it's VS2008, running on typical PCs. Not just my PC.
John
+3  A: 

There's a great comparison table here: http://assemblyrequired.crashworks.org/2009/10/16/timing-square-root/

Long story short, SSE2's ssqrts is about 2x faster than FPU fsqrt, and an approximation + iteration is about 4x faster than that (8x overall).

Also, if you're trying to take a single-precision sqrt, make sure that's actually what you're getting. I've heard of at least one compiler that would convert the float argument to a double, call double-precision sqrt, then convert back to float.

celion