views:

9149

answers:

5

In order to improve performance reading from a file, I'm trying to read the entire content of a big (several MB) file into memory and then use a istringstream to access the information.

My question is, which is the best way to read this information and "import it" into the string stream? A problem with this approach (see bellow) is that when creating the string stream the buffers gets copied, and memory usage doubles.

#include <fstream>
#include <sstream>

using namespace std;

int main() {
  ifstream is;
  is.open (sFilename.c_str(), ios::binary );

  // get length of file:
  is.seekg (0, std::ios::end);
  long length = is.tellg();
  is.seekg (0, std::ios::beg);

  // allocate memory:
  char *buffer = new char [length];

  // read data as a block:
  is.read (buffer,length);

  // create string stream of memory contents
  // NOTE: this ends up copying the buffer!!!
  istringstream iss( string( buffer ) );

  // delete temporary buffer
  delete [] buffer;

  // close filestream
  is.close();

  /* ==================================
   * Use iss to access data
   */

}
+2  A: 

Maybe you should search into memory mapped files instead.

David Pierre
+7  A: 

std::ifstream has a method rdbuf(), that returns a pointer to a filebuf. You can then "push" this filebuf into your stringstream:

int main()
{
    std::ifstream file( "myFile" );

    if ( file )
    {
        std::stringstream buffer;

        buffer << file.rdbuf();

        file.close();

        // operations on the buffer...
    }
}

EDIT: As Martin York remarks in the comments, this might not be the fastest solution since the stringstream operator<< will read the filebuf character by character. You might want to check his answer, where he uses the ifstream read method as you use to do, and then set the stringstream buffer to point on the previously allocated memory.

Luc Touraille
Hi Luc,I agreed with your suggestion... the manipulation of the rdbuf is the way to go! But doens't your solution have the same problem? Don't you create 2 copies of the same buffer, at least momentarily?
Marcos Bento
Because by the time operator<<() sees the result of rdbuf() it is just a stream buffer, no concept of a file buffer at this point, it can not look up its length and thus must use a loop to read 1 char at a time. Also stringstream internal buffer (std::string) must be resized as data as inserted.
Martin York
A: 

This seems like premature optimization to me. How much work is being done in the processing. Assuming a modernish desktop/server, and not an embedded system, copying a few MB of data during intialization is fairly cheap, especially compared to reading the file off of disk in the first place. I would stick with what you have, measure the system when it is complete, and the decide if the potential performance gains would be worth it. Of course if memory is tight, this is in an inner loop, or a program that gets called often (like once a second), that changes the balance.

KeithB
A: 

Another thing to keep in mind is that file I/O is always going to be the slowest operation. Luc Touraille's solution is correct, but there are other options. Reading the entire file into memory at once will be much faster than separate reads.

luke
+13  A: 

OK. I am not saying this will be quicker than reading from the file

But this is a method where you create the buffer once and after the data is read into the buffer use it directly as the source for stringstream.

N.B.It is worth mentioning that the std::ifstream is buffered. It reads data from the file in (relatively large) chunks. Stream operations are performed against the buffer only returning to the file for another read when more data is needed. So before sucking all data into memory please verify that this is a bottle neck.

#include <fstream>
#include <sstream>
#include <vector>

int main()
{
    std::ifstream       file("Plop");
    if (file)
    {
        /*
         * Get the size of the file
         */
        file.seekg(0,std::ios::end);
        std::streampos          length = file.tellg();
        file.seekg(0,std::ios::beg);

        /*
         * Use a vector as the buffer.
         * It is exception safe and will be tidied up correctly.
         * This constructor creates a buffer of the correct length.
         *
         * Then read the whole file into the buffer.
         */
        std::vector<char>       buffer(length);
        file.read(&buffer[0],length);

        /*
         * Create your string stream.
         * Get the stringbuffer from the stream and set the vector as it source.
         */
        std::stringstream       localStream;
        localStream.rdbuf()->pubsetbuf(&buffer[0],length);

        /*
         * Note the buffer is NOT copied, if it goes out of scope
         * the stream will be reading from released memory.
         */
    }
}
Martin York
@Martin York, how do you learn these details, do you read or you research when you encounter a problem and in turn you learn all these details? Thanks so much, bdw.
Gollum
@Gollum: No this is just details gained from two areas. 1) Using the stream classes all the time. 2) Having implemented my own stream classes. Number (2) makes you do a lot of reading about how the stream is supposed to work, because you want it to work the same way for your stream as it works for the standard streams (so that you can re-use the STL library functions for standard streams). The only non-intatve bit of the above is modifying how the stream buffer works.
Martin York
Can you suggest a book or some resources, I want to understand the standard Template library in depth (not just using it, but how it actually works inside)
Gollum
I don't think the bit about "Because char is a POD data type it is not initialized." is correct. The constructor actually has two arguments, the second being which value to initialize the elements with. It defaults to `T()` or `char()` in our case, meaning 0. So all the elements should be 0.
GMan