views:

1200

answers:

10

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. This did not work, the performance is more or less the same. I have the same problem reading the files. I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux).

Any idea about how to improve the performance?

PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy.

+5  A: 

Maybe you should look into memory mapped files.

Check them in this library : Boost.Interprocess

David Pierre
MMFs would have been my suggestion as well. +1 for mentioning Boost's support for it.
OregonGhost
+6  A: 

I would also suggest memory-mapped files but if you're going to use boost I think boost::iostreams::mapped_file is a better match than boost::interprocess.

Andreas Magnusson
I wasn't aware of that one.
David Pierre
A: 

Thank you for the piece of advise but, will it be faster than using buffered chunks of X MB? as:

unsigned int _buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
char* _data_buffer = new char[_buffer_size];  
_file->read(_data_buffer, _buffer_size);
// Read directly from the memory using _data_buffer

I have tried this approach (even including the complete file) and it is not faster than using the STL ifstream line by line :(

Jacob
It's not easy to say, but using a mmapped file allows the OS to manage it in the most efficient way. Buffering the whole file will likely degrade performance as it probably would cause a lot of swapping.
Andreas Magnusson
Is this the only way you can organize the data? Maybe there's another way to deal with the whole problem? Using a DB? Split the file into a hierarchy of files?
Andreas Magnusson
Don't allocate the buffer like that. Use std::vector. See my answer below.
Martin York
+5  A: 

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks:

  • The HDD itself
  • The HDD interface (IDE/SATA/RAID/USB?)
  • Operating system/filesystem
  • C/C++ Library
  • Your code

I'd start by doing some measurements:

  • How long does your code take to read/write a 2GB file,
  • How long does the 'cp' command take to copy it
  • How long does it take to write/read using just big fwrite()/fread() calls

Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds.

How long is it actually taking?

Hi Roddy, using fstream read method with 1.1 GB files and large buffers(128,255 or 512 MB) it takes about 43-48 seconds and it is the same using fstream getline (line by line). cp takes almost 2 minutes to copy the file.

In which case, your're hardware-bound. cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. So it will (as you see) be more than twice as bad as the simple 'read' case.

To improve the speed, the first thing I'd try is a faster hard drive. Maybe a WD Velociraptor?

You haven't said what the disk interface is? SATA is pretty much the easiest/fastest option. Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound...

Roddy
If you're hitting hardware limitations, moving to a marginally faster drive won't help as much as moving to striped drives. Also, why use _cp_ for this--instead use dd if=/dev/zero of=/path and just test the write throughput. Experiment with blocksizes (bs=4K bs=32K) to see how that affects speed.
Mitch Haile
A: 

If you are going to buffer the file yourself, then I'd advise some testing using unbuffered I/O (setvbuf on a file that you've fopened can turn off the library buffering).

Basically, if you are going to buffer yourself, you want to disable the library's buffering, as it's only going to cause you pain. I don't know if there is any way to do that for STL I/O, so I recommend going down to the C-level I/O.

Michael Kohne
A: 

Hi Roddy, using fstream read method with 1.1 GB files and large buffers(128,255 or 512 MB) it takes about 43-48 seconds and it is the same using fstream getline (line by line). cp takes almost 2 minutes to copy the file.

Michael, regarding the setvbuf, I obtain the same results.

I think you are right Roddy and I can not improve the performance due to the hardware limitation.

Jacob
It's better to either add this to your original post, or as a comment on my reply.
Roddy
+2  A: 

Just a thought, but avoid using std::endl as this will force a flush before the buffer is full. Use '\n' instead for a newline.

Evan Teran
Yes, you are right. Good point :)
Jacob
+2  A: 

Don't use new to allocate the buffer like that:

Try: std::vector<>

unsigned int      buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
std::vector<char> data_buffer(buffer_size);
_file->read(&data_buffer[0], buffer_size);

Also read the article on using underscore in identifier names:. Note your code is OK but.

Martin York
I used new and char* just to make it as fast as possible.This code was in a class method, in my personal style I use the underscore to identify class members variables while the local variables of the method have no prefixes.
Jacob
+1  A: 

Using getline() may be inefficient because the string buffer may need to be re-sized several times as data is appended to it from the stream buffer. You can make this more efficient by pre-sizing the string:

Also you can set the size of the iostreams buffer to either very large or NULL(for unbuffered)

// Unbuffered Accesses:
fstream file;
file.rdbuf()->pubsetbuf(NULL,0);
file.open("PLOP");

// Larger Buffer
std::vector<char>  buffer(64 * 1024 * 1024);
fstream            file;
file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());
file.open("PLOP");

std::string   line;
line.reserve(64 * 1024 * 1024);

while(getline(file,line))
{
    // Do Stuff.
}
Martin York
The class uses a char* buffer that is associated with the streambuf of a istringstream. I load the raw data from the file directly to the buffer and use the stringstream to format it later but that did not improve the performance. Just in case I tried with ifstream and pubsetbuf but is slower. Why?
Jacob
A: 

Linux: How to Use RAM as Swap

HTH

plan9assembler