views:

305

answers:

10

I got a char array, a huge array char p[n] read from a txt like.

//1.txt
194.919 -241.808 234.896
195.569 -246.179 234.482
194.919 -241.808 234.896
...

foo(char *p, float x, float y, float z) {

}

I tried to use atof, strtod, but they are real time consuming when the array is too huge, because they will call the strlen(). and the sscanf is also very slow....

I debug into the code and find that both atof() and strtod call the strlen() in the visual studio, we can check the crt code.

strtod() call:
        answer = _fltin2( &answerstruct, ptr, (int)strlen(ptr), 0, 0, _loc_update.GetLocaleT());


atof() call:
        return( *(double *)&(_fltin2( &fltstruct, nptr, (int)strlen(nptr), 0, 0, _loc_update.GetLocaleT())->dval) );

I also try to use strtok, but we should not change any data in the 1.txt.

so any one have the best way to convert all these to float x, y, z.

Visual studio 2008 + WIN7

A: 

As long as you are not using a particularly bad standard library (impossible these times, they are all good) it's not possible to do it faster than atof.

Andreas Bonini
if i want to use the atof(), i must impore the process of strlen(), that means i must fill the char array with lot of '\0', but we should not change the data.
huh.. What? O.o
Andreas Bonini
As the comments to the question now reveal: turns out MSVC has an "impossible" bad standard library. Interpreting his comment here: r9r9r9 gets crummy performance unless he can nul-terminate a chunk of his array at a time. Which he can't do, because it's not modifiable.
Steve Jessop
A: 

Use strtod. It almost certainly does not call strlen. Why would it need to know the length of the input? It merely runs past leading whitespace, then consumes as many characters as possible that make sense for a floating point literal, and then returns a pointer just past that. You can see an example implementation Perhaps you're using it non-optimally? Here's a sample of how to use strtod:

#include <stdio.h>
#include <stdlib.h>
int main() {
    char *p = "1.txt 194.919 -241.808 234.896 195.569 -246.179 234.482 194.919 -241.808 234.896";
    char *end = p;
    char *q;
    double d;
    while(*end++ != ' '); // move past "1.txt"
    do {
        q = end; 
        d = strtod(q, &end);
        printf("%g\n", d);
    } while(*end != '\0');
}

This outputs:

194.919
-241.808
234.896
195.569
-246.179
234.482
194.919
-241.808
234.896

on my machine.

Jason
Of course I read his question. I think he is wrong about `strtod` calling `strlen` and believe that he might be using it incorrectly. Therefore I gave him a sample implementation as well as commenting about `strtod` not using `strlen`.
Jason
may be i was lost, maybe the strtod() didn't call the strlen(), but it is still too slow for methe atof call the strlen()...
First, the above is about as fast as you're going to get. Second, I doubt that `atof` calls `strlen`.
Jason
thanks for the implementation code, i agree with you ,atof/strtod should not call the strlen, but they really did that, i debug the code and find that.
@Jason: r9r9r9 is right. strtod does call strlen in the VC2008 runtime library
Tarydon
@Tarydon: Wow, that is just awful (and unnecessary).
Jason
@r9r9r9: Okay, if you're sure that they use `strlen`, then I would suggest just writing your own `strtod`. Start with the implementation that I linked to already.
Jason
A: 

I don't see any reason why strod() should call strlen(). Of course it might, but nothing in its specification requires it and I'd be suprised if it did. And I'd say that strtod() about as fast as you'll get, short of writing some FPU processor-specific stuff yourself.

anon
yes, i don't think we should call the strlen(), so i think there is better way than the CRT.
@r9r9r9: Sorry, what?
Jason
i means, i don't think the atof/strtod should call the strlen, but they really did, i debug the code the find that. so i think there is better way.
Which compiler?
Jason
visual studio 2008
A: 

Why do you think atof, strtod use strlen? I've never implemented them, but I can't imagine why they'd need to know the length of the input string. It would be of no value to them. I'd use strtod as per Jason's answer. That's what it's for.

And yes, if you have a very large amount of text, it's going to take some time to convert. That's just the way it is.

Michael Kohne
A: 

As others have said, I don't think you're going to do much better than the standard library calls. They have been around for a long time and are quite highly optimized (well, they should be, at least in good implementations).

That said, there are some things that aren't clear to me. Are you reading the whole file into memory and then converting the array to another array? If so, you might want to check that the system you are running on has enough memory to do that with swapping. If you are doing this, would it be possible to just convert one line at a time as you read them off disk instead of storing them?

You could consider multithreading your program. One thread to read and buffer lines off disk, and n threads to process the lines. Dr. Dobb's Journal published a great single-reader/single-writer lockless queue implementation you could use. I've used this in a similar app. My worker threads each have an input queue, and then reader thread reads data off disk and places them into these queues in round robin style.

jbl
yes, i used the file mapping(CreateFileMapping) to get all the data into the memory, and there is enough memory.yes, multithreading is a way, but each thread still to handle this.
It's wrong to assume you can't do better than standard library calls in this case. The library calls are small and general-purpose. If speed matters more than anything else, you can do dramatically better.
dmazzoni
@dmazzoni - Fair enough. There wasn't much info in the OP, so I assumed the general case in which the input is basically unknown.
jbl
EDIT: Oops, I see this has pretty much already been suggested. @r9r9r9 It might be worth testing a quick implementation that reads the file in line by line into a buffer with, say, fgets(), which will null terminate input for you. This would keep the data on disk unmodified.
jbl
@jbl: yes, i have tried thousand way, and find out that, fscanf is better that fget+sscanf, and use the strtok() is better that fscanf.
A: 

Check out this code.

It can be further optimized if there's no need to support scientific representation, '+' sign, or leading tabs.

It doesn't use strlen, or any other standard library string routine.

// convert floating-point value in string represention to it's numerical value
// return false if NaN
// F is float/double
// T is char or wchar_t
// '1234.567' -> 1234.567
template <class F, class T> inline bool StrToDouble(const T* pczSrc, F& f)
{
    f= 0;

    if (!pczSrc)
        return false;

    while ((32 == *pczSrc) || (9 == *pczSrc))
        pczSrc++;

    bool bNegative= (_T('-') == *pczSrc);

    if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
        pczSrc++;

    if ( (*pczSrc < _T('0')) || (*pczSrc > _T('9')) )
        return false;

    // todo: return false if number of digits is too large

    while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
    {
        f= f*10. + (*pczSrc-_T('0'));
        pczSrc++;
    }

    if (_T('.') == *pczSrc)
    {
        pczSrc++;

        double e= 0.;
        double g= 1.;

        while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
        {
            e= e*10. + (*pczSrc-_T('0'));
            g= g*10.                    ;
            pczSrc++;
        }

        f+= e/g;
    }

    if ( (_T('e') == *pczSrc) || (_T('E') == *pczSrc) ) // exponent, such in 7.32e-2
    {
        pczSrc++;

        bool bNegativeExp= (_T('-') == *pczSrc);

        if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
            pczSrc++;

        int nExp= 0;
        while ( (*pczSrc >= _T('0')) && (*pczSrc <= _T('9')) )
        {
            nExp= nExp*10 + (*pczSrc-_T('0'));
            pczSrc++;
        }

        if (bNegativeExp)
            nExp= -nExp;

        // todo: return false if exponent / number of digits of exponent is too large

        f*= pow(10., nExp);
    }

    if (bNegative)
        f= -f;

    return true;
}
Lior Kogan
The post tagged C, not C++.
Alok
thanks, i will try this.
A: 

How about something like:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static float frac[] =
{
    0.000,
    0.001,
    0.002,
    ...               // fill in
    0.997,
    0.998,
    0.999,
};

static float exp[] =
{
    1e-38,
    1e-37,
    1e-36,
    ...               // fill in
    1e+36,
    1e+37,
    1e+38,
};

float cvt(char* p)
{
    char* d = strchr(p, '.');   // Find the decimal point.
    char* e = strchr(p, 'e');   // Find the exponent.
    if (e == NULL)
        e = strchr(p, 'E');

    float num = atoi(p);
    if (num > 0) {
        num += frac[atoi(d + 1)];
    } else {
        num -= frac[atoi(d + 1)];
    }
    if (e)
        num *= exp[atoi(e)];
    return num;
}

int main()
{
    char line[100];
    while(gets(line)) {
        printf("in %s, out %g\n", line, cvt(line));
    }
}

Should be good to three significant digits.


Edit: watch out for big mantissas.
Edit again: and negative exponents. :-(

Richard Pennington
the float value is unknow, we don't konw the digits, maybe 1, 2 ,3 4 or....
Good point, you'd have to manipulate the fraction a little bit.
Richard Pennington
If you want to roll your own, it's probably better to find a free (as in speech) version of atof.c online rather than start from scratch.
Steve Jessop
+1  A: 

If you can make additional assumptions about the format of the floating point values, parsing them yourself might increase performance.

Example code for parsing ' ' or '\n'-separated values without exponents and no input validation:

float parsef(const char **str)
{
    const char *cc = *str;

    _Bool neg = (*cc == '-');
    if(neg) ++cc;

    float value = 0, e = 1;

    for(; *cc != '.'; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        value *= 10;
        value += *cc - '0';
    }

    for(++cc;; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        e /= 10;
        value += (*cc - '0') * e;
    }
}

Example code:

const char *str = "42 -15.4\n23.001";
do printf("%f\n", parsef(&str));
while(*str++);
Christoph
yes, you a right, the problem is we don't know the fromat of the float value...
+1  A: 

Okay, how about doing the tokenization yourself and then calling strtod.

What I'm thinking is something like this:

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char buffer[64];
    unsigned int idx = 0;

    // copy over current number
    while (*current != '\0' && !isspace(*current))
    {
        buffer[idx++] = *current++;
    }
    buffer[idx] = '\0';

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }

    // use strtod to convert buffer   
}

Some issues with this is the tokenization is very simple. It will work for the format you posted, but if the format varies (another line uses : to separate the numbers), it won't work.

Another issue is that the code assumes all numbers have < 64 characters. If they are longer, you'll get a buffer overflow.

Also, the copying to a temporary buffer will add some overhead (but hopefully less then the overhead of constantly doing a strlen on the entire buffer). I know you said you can't change the original buffer, but can you do a temporary change (i.e. the buffer can change as as long as you return it to it's original state before you return):

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char *next_sep = current;
    while (*next_sep != '\0' && !isspace(*next_sep))
    {
        next_sep++;
    }

    // save the separator before overwriting it
    char tmp = *next_sep;
    *next_sep = '\0';

    // use strtod on current

   // Restore the separator.
   *next_sep = tmp;

    current = next_sep;

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }
}

This technique means no copying and no worries about buffer overflow. You do need to temporarily modify the buffer; hopefully that is

R Samuel Klatchko
thanks, the problem is we should not change the input data, and if we copy one backup, it will still slow.if we allow to change the data, strtok() will be a good way.
Copying a small portion of the buffer for each float to convert shouldn't be too slow - as long as you only copy each character a fixed number of times, it'll overall be an O(N) operation (instead of the O(N^2) behaviour you're getting with `strtod`). Copying a few characters is a tiny overhead compared to the actual floating point conversion.
caf
A: 

I doubt if strlen is costing you much.

If you can take advantage of your numbers falling in a relatively restricted range, then what I suggest is to parse it yourself, doing as little computation as possible, such as:

#define DIGIT(c) ((c)>='0' && (c)<='9')

BOOL parseNum(char* *p0, float *f){
  char* p = *p0;
  int n = 0, frac = 1;
  BOOL bNeg = FALSE;
  while(*p == ' ') p++;
  if (*p == '-'){p++; bNeg = TRUE;}
  if (!(DIGIT(*p) || *p=='.')) return FALSE;
  while(DIGIT(*p)){
    n = n * 10 + (*p++ - '0');
  }
  if (*p == '.'){
    p++;
    while(DIGIT(*p)){
      n = n * 10 + (*p++ - '0');
      frac *= 10;
    }
  }
  *f = (float)n/(float)frac;
  if (bNeg) *f = -*f;
  *p0 = p;
  return TRUE;
}
Mike Dunlavey
`strlen` is costing him a lot, because it's being called on the entire remaining buffer for each floating point value in the buffer. This turns the whole operation into an O(N^2) deal.
caf
@caf: Yeah that seems kind of silly. A profiler or stackshot would pick that out real quick.
Mike Dunlavey
... FYI stackshots: http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802
Mike Dunlavey