ansaurus

Question

Parsing a line with a variable number of entries in C or C++ (no boost)

Answer 1

+3 A:

The best way I know of to parse an ascii text file is to read it line-by-line and use strtok. It's a C function, but it'll break your input into individual tokens for you. Then, you can use the string parsing functions atoi and strtod to parse your numeric values. For the file format you specified, I'd do something like this:

  string line;
  ifstream f(argv[1]);
  if(!f.is_open()) {
    cout << "The file you specified could not be read." << endl;
    return 1;
  }

  while(!f.eof()) {
    getline(f, line);
    if(line == "" || line[0] == '#') continue;

    char *ptr, *buf;
    buf = new char[line.size() + 1];
    strcpy(buf, line.c_str());

    Peptide pep;
    pep.mass     = strtod(strtok(buf, " "), NULL);
    pep.sequence = strtok(NULL, " ");
    pep.numK     = strtol(strtok(NULL, " "), NULL, 10);
    pep.numPTS   = strtol(strtok(NULL, " "), NULL, 10);
    pep.numM     = strtol(strtok(NULL, " "), NULL, 10);
    while(ptr = strtok(NULL, " "))
      pep.parents.insert(strtol(ptr, NULL, 10));

    cout << "mass: " << mass << endl
         << "sequence: " << sequence << endl
         << "numK: " << numK << endl
         << "numPTS: " << numPTS << endl
         << "numM: " << numM << endl
         << "parents:" << endl;

    set<int>::iterator it;
    for(it = parents.begin(); it != parents.end(); it++)
      cout << "\t- " << *it << endl;
  }
  f.close();

Benson 2010-06-25 02:22:18

See this: atoi considered harmful : http://blog.mozilla.com/nnethercote/2009/03/13/atol-considered-harmful/.

Stephen 2010-06-25 02:24:31

Wow, this looks perfect. Thank you.

lashleigh 2010-06-25 02:24:48

@Benson : You got a +1 from me, but yeah, any of `sscanf`, `istringstream`, `strtol` would be better.

Stephen 2010-06-25 02:34:54

@Stephen Good point; I've switched to strtol, as you suggested.

Benson 2010-06-25 02:35:24

Answer 2

+10 A:

If you want to use C++, use C++:

std::list<Peptide> list;
std::ifstream file("filename.ext");

while (std::getline(file, line)) {

    // Ignore empty lines.
    if (line.empty()) continue;

    // Stringstreams are your friends!
    std::istringstream row(line);

    // Read ordinary data members.
    Peptide peptide;
    row >> peptide.mass
        >> peptide.sequence
        >> peptide.numK
        >> peptide.numPTS
        >> peptide.numM;

    // Read numbers until reading fails.    
    int parent;
    while (row >> parent)
        peptide.parents.insert(parent);

    // Do whatever you like with each peptide.
    list.push_back(peptide);

}

Jon Purdy 2010-06-25 02:32:39

Out of curiosity, how does istringstream perform?

Benson 2010-06-25 02:35:02

I'm not familiar with string streams but I will certainly check them out. Thanks!

lashleigh 2010-06-25 02:37:56

+1 for a C++ centric solution!

Jim Lewis 2010-06-25 02:38:25

@Benson: an `istringstream` is an input stream backed by a `string` instead of, say, a file; performance varies with the application, but is more than adequate for these purposes. @lashleigh: No problem. They're very useful for parsing and formatting. @Jim: Thanks!

Jon Purdy 2010-06-25 02:45:12

@Jon Purdy Well sure, but what are these purposes? The files are going to contain at least millions maybe hundreds of millions of lines.

lashleigh 2010-06-25 02:52:47

If you find that your bottleneck is in stringstream, you can do the same thing that boost::lexical_cast does, which is to create your own stream class that uses fixed size buffers, possibly statically allocated, to allow better optimizations and prevent extra allocations. But don't do that kind of thing untill your sure its your bottleneck.

Michael Anderson 2010-06-25 03:12:15

Just for fun I swapped out the logic in my test program for the stringstream-based logic here. That version of the program takes approximately twice as long to run on a 300,000 line input file.

Benson 2010-06-25 03:17:53

More data! I tried a 6.6 million line input file: my version runs in 8.8 seconds, and this version runs in about 17.8 seconds. Code here: http://gist.github.com/452351 and here: http://gist.github.com/452353

Benson 2010-06-25 03:33:40

Interesting benchmarks! I guess there's always a certain price to be paid for clarity and genericity.

Jon Purdy 2010-06-25 03:42:50

A certain price? 100% speed increase is, in my humble opinion, quite significant. I definitely agree that your code is much, much prettier than mine, but mine's still perfectly readable. :-)

Benson 2010-06-25 04:59:54

Which compiler and settings? Did you disable _SECURE_SCL (Visual C++ only)?

Fabio Fracassi 2010-06-25 09:15:53

@Benson: Oh, yours is undoubtedly superior in terms of performance, and that's the important metric here. I was only offering an idiomatic C++ solution for the sake of completeness and clarity. "A certain price" was a tongue-in-cheek understatement, but then again, you can't very well call this version slow: 370,786 entries per second is quite reasonable!

Jon Purdy 2010-06-25 14:42:13

@Fabio I was using g++ for both tests. I did my tests on a 4 year old linux desktop machine. @Jon I definitely see where you're coming from -- your solution does run fairly quickly, and is idiomatic C++. Mine is a mongrel C / C++ solution. I am actually inclined to build pure C version and see how fast that runs... If I do I'll let you know what I find.

Benson 2010-06-25 21:01:05

ansaurus

tags:

views:

answers:

Parsing a line with a variable number of entries in C or C++ (no boost)

related questions