I am looking for a fast, integer only algorithm to find the square root (integer part thereof) of an unsigned integer. The code must have excellent performance on ARM Thumb 2 processors. It could be assembly language or C code.
Any hints welcome.
I am looking for a fast, integer only algorithm to find the square root (integer part thereof) of an unsigned integer. The code must have excellent performance on ARM Thumb 2 processors. It could be assembly language or C code.
Any hints welcome.
One common approach is bisection.
hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
if( mid2 < number ) {
lo = mid
else
hi = mid
mid = ( hi + lo ) / 2
mid2 = mid*mid
Something like that should work reasonably well. It makes log2(number) tests, doing
log2(number) multiplies and divides. Since the divide is a divide by 2, you can replace it with a >>
.
The terminating condition may not be spot on, so be sure to test a variety of integers to be sure that the division by 2 doesn't incorrectly oscillate between two even values; they would differ by more than 1.
If exact accuracy isn't required, I have a fast approximation for you, that uses 260bytes of ram (you could halve that, but don't).
int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};
int fisqrt(int val)
{
int cnt=0;
int t=val;
while (t) {cnt++;t>>=1;}
if (6>=cnt) t=(val<<(6-cnt));
else t=(val>>(cnt-6));
return (ftbl[cnt]*ftbl2[t&31])>>15;
}
Here's the code to generate the tables:
ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");
for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");
Over the range 1->2^20, the maximum error is 11, and over the range 1->2^30, it's about 256. You could use larger tables and minimise this. It's worth mentioning that the error will always be negative - i.e. when it's wrong, the value will be LESS than the correct value.
You might do well to follow this with a refining stage.
The idea is simple enough: (ab)^0.5 = a^0.b * b^0.5.
So, we take the input X = A*B where A=2^N and 1<=B<2 Then we have a lookuptable for sqrt(2^N), and a lookuptable for sqrt(1<=B<2). We store the lookuptable for sqrt(2^N) as integer, which might be a mistake (testing shows no ill effects), and we store the lookuptable for sqrt(1<=B<2) at 15bits of fixed-point.
We know that 1<=sqrt(2^N)<65536, so that's 16bit, and we know that we can really only multiply 16bitx15bit on an ARM, without fear of reprisal, so that's what we do.
In terms of implementation, the while(t) {cnt++;t>>=1;} is effectively a count-leading-bits instruction (CLB), so if your version of the chipset has that, you're winning! Also, the shift instruction would be easy to implement with a bidirectional shifter, if you have one? There's a Lg[N] algorithm for counting the highest set bit here.
In terms of magic numbers, for changing table sizes, THE magic number for ftbl2 is 32, though note that 6 (Lg[32]+1) is used for the shifting.
I have settled to something similar to the binay digit-by-digit algorithm described in this Wikipedia article.
Integer Square Roots by Jack W. Crenshaw could be useful as another reference.
The C Snippets Archive also has an integer square root implementation. That one goes beyond just the integer result, and calculates extra fractional (fixed-point) bits of the answer.
I settled on the following code. It's essentially from the Wikipedia article on square-root computing methods. But it has been changed to use stdint.h
types uint32_t
etc. Strictly speaking the return type could be changed to uint16_t
.
/**
* \brief Fast Square root algorithm
*
* Fractional parts of the answer are discarded. That is:
* - SquareRoot(3) --> 1
* - SquareRoot(4) --> 2
* - SquareRoot(5) --> 2
* - SquareRoot(8) --> 2
* - SquareRoot(9) --> 3
*
* \param[in] a_nInput - unsigned integer for which to find the square root
*
* \return Integer square root of the input value.
*/
uint32_t SquareRoot(uint32_t a_nInput)
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
return res;
}
The nice thing, I discovered, is that a fairly simple modification can return the "rounded" answer. I found this useful in a certain application for greater accuracy. Note that in this case, the return type must be uint32_t
because the rounded square root of 232 - 1 is 216.
/**
* \brief Fast Square root algorithm, with rounding
*
* This does arithmetic rounding of the result. That is, if the real answer
* would have a fractional part of 0.5 or greater, the result is rounded up to
* the next integer.
* - SquareRootRounded(2) --> 1
* - SquareRootRounded(3) --> 2
* - SquareRootRounded(4) --> 2
* - SquareRootRounded(6) --> 2
* - SquareRootRounded(7) --> 3
* - SquareRootRounded(8) --> 3
* - SquareRootRounded(9) --> 3
*
* \param[in] a_nInput - unsigned integer for which to find the square root
*
* \return Integer square root of the input value.
*/
uint32_t SquareRootRounded(uint32_t a_nInput)
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
/* Do arithmetic rounding to nearest integer */
if (op > res)
{
res++;
}
return res;
}
Here are a whole bunch of them. I remember looking at them when I was thinking of doing Nintendo DS programming.