tags:

views:

99

answers:

2

I have a file containing lines of the form,

double mass, string seq, int K, int TS, int M, [variable number of ints]
688.83       AFTDSK      1      1       0       3384 2399 1200
790.00       MDSSTK      1      3       1       342 2

I need a (preferably simple) way of parsing this file without boost. If the number of values per line had been constant then I would have used the solution here.

Each line will become an object of class Peptide:

class Peptide {
    public:
        double mass;
        string sequence;
        int numK;
        int numPTS;
        int numM;
        set<int> parents;
 }

The first three integers have specific variable names in the object while all the following integers need to be inserted into a set.


I was fortunate enough to get two really awesome responses but the run time differences made the C implementation the best answer for me.

+3  A: 

The best way I know of to parse an ascii text file is to read it line-by-line and use strtok. It's a C function, but it'll break your input into individual tokens for you. Then, you can use the string parsing functions atoi and strtod to parse your numeric values. For the file format you specified, I'd do something like this:

  string line;
  ifstream f(argv[1]);
  if(!f.is_open()) {
    cout << "The file you specified could not be read." << endl;
    return 1;
  }

  while(!f.eof()) {
    getline(f, line);
    if(line == "" || line[0] == '#') continue;

    char *ptr, *buf;
    buf = new char[line.size() + 1];
    strcpy(buf, line.c_str());

    Peptide pep;
    pep.mass     = strtod(strtok(buf, " "), NULL);
    pep.sequence = strtok(NULL, " ");
    pep.numK     = strtol(strtok(NULL, " "), NULL, 10);
    pep.numPTS   = strtol(strtok(NULL, " "), NULL, 10);
    pep.numM     = strtol(strtok(NULL, " "), NULL, 10);
    while(ptr = strtok(NULL, " "))
      pep.parents.insert(strtol(ptr, NULL, 10));

    cout << "mass: " << mass << endl
         << "sequence: " << sequence << endl
         << "numK: " << numK << endl
         << "numPTS: " << numPTS << endl
         << "numM: " << numM << endl
         << "parents:" << endl;

    set<int>::iterator it;
    for(it = parents.begin(); it != parents.end(); it++)
      cout << "\t- " << *it << endl;
  }
  f.close();
Benson
See this: atoi considered harmful : http://blog.mozilla.com/nnethercote/2009/03/13/atol-considered-harmful/.
Stephen
Wow, this looks perfect. Thank you.
lashleigh
@Benson : You got a +1 from me, but yeah, any of `sscanf`, `istringstream`, `strtol` would be better.
Stephen
@Stephen Good point; I've switched to strtol, as you suggested.
Benson
+10  A: 

If you want to use C++, use C++:

std::list<Peptide> list;
std::ifstream file("filename.ext");

while (std::getline(file, line)) {

    // Ignore empty lines.
    if (line.empty()) continue;

    // Stringstreams are your friends!
    std::istringstream row(line);

    // Read ordinary data members.
    Peptide peptide;
    row >> peptide.mass
        >> peptide.sequence
        >> peptide.numK
        >> peptide.numPTS
        >> peptide.numM;

    // Read numbers until reading fails.    
    int parent;
    while (row >> parent)
        peptide.parents.insert(parent);

    // Do whatever you like with each peptide.
    list.push_back(peptide);

}
Jon Purdy
Out of curiosity, how does istringstream perform?
Benson
I'm not familiar with string streams but I will certainly check them out. Thanks!
lashleigh
+1 for a C++ centric solution!
Jim Lewis
@Benson: an `istringstream` is an input stream backed by a `string` instead of, say, a file; performance varies with the application, but is more than adequate for these purposes. @lashleigh: No problem. They're very useful for parsing and formatting. @Jim: Thanks!
Jon Purdy
@Jon Purdy Well sure, but what are these purposes? The files are going to contain at least millions maybe hundreds of millions of lines.
lashleigh
If you find that your bottleneck is in stringstream, you can do the same thing that boost::lexical_cast does, which is to create your own stream class that uses fixed size buffers, possibly statically allocated, to allow better optimizations and prevent extra allocations. But don't do that kind of thing untill your sure its your bottleneck.
Michael Anderson
Just for fun I swapped out the logic in my test program for the stringstream-based logic here. That version of the program takes approximately twice as long to run on a 300,000 line input file.
Benson
More data! I tried a 6.6 million line input file: my version runs in 8.8 seconds, and this version runs in about 17.8 seconds. Code here: http://gist.github.com/452351 and here: http://gist.github.com/452353
Benson
Interesting benchmarks! I guess there's always a certain price to be paid for clarity and genericity.
Jon Purdy
A certain price? 100% speed increase is, in my humble opinion, quite significant. I definitely agree that your code is much, much prettier than mine, but mine's still perfectly readable. :-)
Benson
Which compiler and settings? Did you disable _SECURE_SCL (Visual C++ only)?
Fabio Fracassi
@Benson: Oh, yours is undoubtedly superior in terms of performance, and that's the important metric here. I was only offering an idiomatic C++ solution for the sake of completeness and clarity. "A certain price" was a tongue-in-cheek understatement, but then again, you can't very well call this version slow: 370,786 entries per second is quite reasonable!
Jon Purdy
@Fabio I was using g++ for both tests. I did my tests on a 4 year old linux desktop machine. @Jon I definitely see where you're coming from -- your solution does run fairly quickly, and is idiomatic C++. Mine is a mongrel C / C++ solution. I am actually inclined to build pure C version and see how fast that runs... If I do I'll let you know what I find.
Benson