ansaurus

Question

effeciently read large spreadsheet file in C++

Answer 1

+1 A:

It looks like your being bottle-necked by IO. Instead of reading the file line by line, read it in blocks of maybe 8 MB. Parse the block read for records and determine if end of the block is a partial record. If it is, copy the portion of the last record from the block and prepend it to the next block. Repeat until the file is all read. This way, for a 64 MB file you're only making 8 IO requests. You can experiment with block size to determine what gives the best performance vs memory usage.

Exception 2010-06-24 15:08:55

Answer 2

A:

If reading the whole data into memory acceptable (and apparently it is), then I'd do this:

Read the whole file into a std::vector
Populate a vector > which contains the start positions of all newline characters and cells the data. These positions denote the start/end of each cell

Some code sketch to demonstrate the idea:

vector<vector<vector<char>::size_Type> > rows;
for ( vector<char>::size_type i = 0; i < data.size(); ++i ) {
    vector<vector<char>::size_type> currentRow;
    currentRow.push_back( i );
    while ( data[i] != '\n' ) {
        if ( data[i] == ',' ) { // XXX consider comma at end of line
            currentRow.push_back( i );
        }
    }
    rows.push_back( currentRow );  
}
// XXX consider files which don't end in a newline

Thus, you know the positions of all newlines and all commas, and you have the complete CSV date available as one contiguous memory block. So you can easily extract a cell text like this:

// XXX error checking omitted for simplicity
string getCellText( int row, int col )
{
     // XXX Needs handling for last cell of a line
     const vector<char>::size_type start = rows[row][col];
     const vector<char>::size_type end = rows[row][col + 1]; 
     return string(data[start], data[end]);
}

Frerich Raabe 2010-06-24 16:15:28

Answer 3

A:

This article should be helpful.

In short:
1. Either use memory mapped files OR read file in 4kbyte blocks to access the data. Memory-mapped files will be faster.
2. Try to avoid using push_back, std::string operations (like +) and similar routines from stl within parsing loop. They are nice, but they ALL use dynamically allocated memory, and dynamic memory allocation is slow. Anything that is being frequently dynamically allocated, will make your program slower. Try to preallocate all buffers before parsing. Counting all tokens in order to preallocate memory for them shouldn't be difficult.
3. Use profiler to identify what causes the slowdown.
4. You may want to try to avoid using iostream's << and >> operators, and parse file yourself.

In general, efficient C/C++ parser implementation should be able to parse 20 megabytes big text file within 3 seconds.

SigTerm 2010-06-24 16:41:57

ansaurus

tags:

views:

answers:

effeciently read large spreadsheet file in C++

related questions