views:

79

answers:

3

I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.

Several things to note:

  • I did reserve enough capacity of the vector but not much helpful.
  • I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
  • The file is tab-delimited instead of comma-delimited, but I don't think it matters.

Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.

Here are my questions:

  1. Is there a way to read such a spreadsheet file more efficiently?

  2. Shall I read the file by buffer instead of line by line? If so, how to read by buffer and use the csvrow class?

  3. I haven't tried boost tokenizer, is that more efficient?

Thank you for your help!

+1  A: 

It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.

Exception
A: 

If reading the whole data into memory acceptable (and apparently it is), then I'd do this:

  1. Read the whole file into a std::vector
  2. Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell

Some code sketch to demonstrate the idea:

vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
    vector<vector<char>::size_type> currentRow;
    currentRow.push_back( i );
    while ( data[i] != '\n' ) {
        if ( data[i] == ',' ) { // XXX consider comma at end of line
            currentRow.push_back( i );
        }
    }
    rows.push_back( currentRow );  
}
// XXX consider files which don't end in a newline

Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:

// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
     // XXX Needs handling for last cell of a line
     const vector<char>::size_type start = rows[row][col];
     const vector<char>::size_type end = rows[row][col + 1]; 
     return string(data[start], data[end]);
}
Frerich Raabe
A: 

This article should be helpful.

In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.

In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.

SigTerm