sqrtss
gives a correctly rounded result. rsqrtss
gives an approximation to the reciprocal, accurate to about 11 bits.
sqrtss
is generating a far more accurate result, for when accuracy is required. rsqrtss
exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss
.
edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps
or sqrtps
, both of which process four floats per instruction.