I normally use the method described in csv parser to read spreadsheet files. However, when reading a 64MB file which has around 40 columns and 250K rows of data, it takes about 4 minutes. In the original method, a CSVRow class is used to read the file row by row, and a private vector is used to store all the data in a row.
Several things to note:
- I did reserve enough capacity of the vector but not much helpful.
- I also need to create instances of some class when reading each line, but even when the code just read in the data without creating any instances, it takes long time.
- The file is tab-delimited instead of comma-delimited, but I don't think it matters.
Since some columns in that file are not useful data, I changed the method to have a private string member to store all the data and then find the position of the (n-1)th and the nth delimiter to get the useful data (of course there are many useful columns). By doing so, I avoid some push_back operations, and cut the time to a little more than 2 minutes. However, that still seems too long to me.
Here are my questions:
Is there a way to read such a spreadsheet file more efficiently?
Shall I read the file by buffer instead of line by line? If so, how to read by buffer and use the csvrow class?
- I haven't tried boost tokenizer, is that more efficient?
Thank you for your help!