tags:

views:

1750

answers:

4

I'm looking for an extremely fast atof() implementation on IA32 optimized for US-en locale, ASCII, and non-scientific notation. The windows multithreaded CRT falls down miserably here as it checks for locale changes on every call to isdigit(). Our current best is derived from the best of perl + tcl's atof implementation, and outperforms msvcrt.dll's atof by an order of magnitude. I want to do better, but am out of ideas. The BCD related x86 instructions seemed promising, but I couldn't get it to outperform the perl/tcl C code. Can any SO'ers dig up a link to the best out there? Non x86 assembly based solutions are also welcome...

Clarifications based upon initial answers:

inaccuracies of ~2 ulp are fine for this application. the numbers to be converted will arrive in ascii messages over the network in small batches and our application needs to convert them in the lowest latency possible.

Dave

A: 

I remember we had a Winforms application that performed so slowly while parsing some data interchange files, and we all thought it was the db server thrashing, but our smart boss actually found out that the bottleneck was in the call that was converting the parsed strings into decimals!

The simplest is to loop for each digit (character) in the string, keep a running total, multiply the total by 10 then add the value of the next digit. Keep on doing this until you reach the end of the string or you encounter a dot. If you encounter a dot, separate the whole number part from the fractional part, then have a multiplier that divides itself by 10 for each digit. Keep on adding them up as you go.

Example: 123.456

running total = 0, add 1 (now it's 1) running total = 1 * 10 = 10, add 2 (now it's 12) running total = 12 * 10 = 120, add 3 (now it's 123) encountered a dot, prepare for fractional part multiplier = 0.1, multiply by 4, get 0.4, add to running total, makes 123.4 multiplier = 0.1 / 10 = 0.01, multiply by 5, get 0.05, add to running total, makes 123.45 multipiler = 0.01 / 10 = 0.001, multiply by 6, get 0.006, add to running total, makes 123.456

Of course, testing for a number's correctness as well as negative numbers will make it more complicated. But if you can "assume" that the input is correct, you can make the code much simpler and faster.

cruizer
I think you should do the decimal part only once. Collect everything up to the dot, when you reach the dot, start a new number and the number 1. Multiply both by 10. At the end, you have 3 floats, the integer part, the decimal part as an whole number and the amount you need to divide the decimal part to make it a decimal. Now divide and add: intpart + decpart/divisor
jmucchiello
+4  A: 

What is your accuracy requirement? If you truly need it "correct" (always gets the nearest floating-point value to the decimal specified), it will probably be hard to beat the standard library versions (other than removing locale support, which you've already done), since this requires doing arbitrary precision arithmetic. If you're willing to tolerate an ulp or two of error (and more than that for subnormals), the sort of approach proposed by cruzer's can work and may be faster, but it definitely will not produce <0.5ulp output. You will do better accuracy-wise to compute the integer and fractional parts separately, and compute the fraction at the end (e.g. for 12345.6789, compute it as 12345 + 6789 / 10000.0, rather than 6*.1 + 7*.01 + 8*.001 + 9*0.0001) since 0.1 is an irrational binary fraction and error will accumulate rapidly as you compute 0.1^n. This also lets you do most of the math with integers instead of floats.

The BCD instructions haven't been implemented in hardware since (IIRC) the 286, and are simply microcoded nowadays. They are unlikely to be particularly high-performance.

puetzk
The approach you suggest is in fact what we're doing with the ideas we grabbed from perl/tcl. As you mention, it's faster + more precise. Thanks for the tidbit about the BCD history - I hadn't heard that, but the cycle counts seemed super high
Dave Moore
A: 

Have you considered looking into having the GPU do this work? If you can load the strings into GPU memory and have it process them all you may find a good algorithm that will run significantly faster than your processor.

Alternately, do it in an FPGA - There are FPGA PCI-E boards that you can use to make arbitrary coprocessors. Use DMA to point the FPGA at the part of memory containing the array of strings you want to convert and let it whizz through them leaving the converted values behind.

Have you looked at a quad core processor? The real bottleneck in most of these cases is memory access anyway...

Adam Davis
These floats arrive via the network frequently, but at random times. Our emphasis is on latency, not throughput, otherwise I think your GPU or FPGA suggestions would be solid. We use 8 and 16 core cpus, and memory/cache matters with loop unrolling + conversions like those mentioned above.
Dave Moore
A: 

Hi!

I've implemented something you may find useful. In comparision with atof it's about x5 faster and if used with __forceinline about x10 faster. Another nice thing is that it seams to have exactly same arithmetic as crt implementation. Of course it has some cons too: -it supports only single precision float, -and doesn't scan any special values like #INF, etc...

If anyone will find it useful please scratch me an e-mail :) Also please keep copyright notice in the source code, any comments/bugfixes are welcome!

Cheers, Marcin

http://3dweb.pl [email protected]

__forceinline bool float_scan(const wchar_t* wcs, float* val)
// (C)2009 Marcin Sokalski [email protected] - All rights reserved.
{
int hdr=0;
while (wcs[hdr]==L' ')
 hdr++;

int cur=hdr;

bool negative=false;
bool has_sign=false;

if (wcs[cur]==L'+' || wcs[cur]==L'-')
{
 if (wcs[cur]==L'-')
  negative=true;
 has_sign=true;
 cur++;
}
else
 has_sign=false;

int quot_digs=0;
int frac_digs=0;

bool full=false;

wchar_t period=0;
int binexp=0;
int decexp=0;
unsigned long value=0;

while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
{
 if (!full)
 {
  if (value>=0x19999999 && wcs[cur]-L'0'>5 || value>0x19999999)
  {
   full=true;
   decexp++;
  }
  else
   value=value*10+wcs[cur]-L'0';
 }
 else
  decexp++;

 quot_digs++;
 cur++;
}

if (wcs[cur]==L'.' || wcs[cur]==L',')
{
 period=wcs[cur];
 cur++;

 while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
 {
  if (!full)
  {
   if (value>=0x19999999 && wcs[cur]-L'0'>5 || value>0x19999999)
    full=true;
   else
   {
    decexp--;
    value=value*10+wcs[cur]-L'0';
   }
  }

  frac_digs++;
  cur++;
 }
}

if (!quot_digs && !frac_digs)
 return false;

wchar_t exp_char=0;

int decexp2=0; // explicit exponent
bool exp_negative=false;
bool has_expsign=false;
int exp_digs=0;

// even if value is 0, we still need to eat exponent chars
if (wcs[cur]==L'e' || wcs[cur]==L'E')
{
 exp_char=wcs[cur];
 cur++;

 if (wcs[cur]==L'+' || wcs[cur]==L'-')
 {
  has_expsign=true;
  if (wcs[cur]=='-')
   exp_negative=true;
  cur++;
 }

 while (wcs[cur]>=L'0' && wcs[cur]<=L'9')
 {
  if (decexp2>=0x19999999)
   return false;
  decexp2=10*decexp2+wcs[cur]-L'0';
  exp_digs++;
  cur++;
 }

 if (exp_negative)
  decexp-=decexp2;
 else
  decexp+=decexp2;
}

// end of wcs scan, cur contains value's tail

if (value)
{
 while (value<=0x19999999)
 {
  decexp--;
  value=value*10;
 }

 if (decexp)
 {
  // ensure 1bit space for mul by something lower than 2.0
  if (value&0x80000000)
  {
   value>>=1;
   binexp++;
  }

  if (decexp>308 || decexp<-307)
   return false;

  // convert exp from 10 to 2 (using FPU)
  int E;
  double v=pow(10.0,decexp);
  double m=frexp(v,&E);
  m=2.0*m;
  E--;
  value=(unsigned long)floor(value*m);

  binexp+=E;
 }

 binexp+=23; // rebase exponent to 23bits of mantisa


 // so the value is: +/- VALUE * pow(2,BINEXP);
 // (normalize manthisa to 24bits, update exponent)
 while (value&0xFE000000)
 {
  value>>=1;
  binexp++;
 }
 if (value&0x01000000)
 {
  if (value&1)
   value++;
  value>>=1;
  binexp++;
  if (value&0x01000000)
  {
   value>>=1;
   binexp++;
  }
 }

 while (!(value&0x00800000))
 {
  value<<=1;
  binexp--;
 }

 if (binexp<-127)
 {
  // underflow
  value=0;
  binexp=-127;
 }
 else
 if (binexp>128)
  return false;

 //exclude "implicit 1"
 value&=0x007FFFFF;

 // encode exponent
 unsigned long exponent=(binexp+127)<<23;
 value |= exponent;
}

// encode sign
unsigned long sign=negative<<31;
value |= sign;

if (val)
{
 *(unsigned long*)val=value;
}

return true;
}
zonk
This code looks like it goes to a lot of trouble to be exact, then fails. `pow(10.0,decexp)` is nonexact for non-trivially-small values of `decexp`. And it seems like you're sending subnormal results to 0 rather than their values. Please correct me if I'm mistaken.
R..