tags:

views:

63

answers:

2

Hello,

Basic c++ question here. I'm trying to read a large file on Windows 7 Pro. C++ compiler is Visual Studio 2010. (ver 16.0). I'm finding that the program runs about 5 times slower on Windows 7 than one on a virtual machine running Ubuntu on the same box. Ubuntu version 10.04 using gcc 4.4.3. The file is rather large ~900MB. The code in question can be narrowed to the following snippet. Any clues on Windows specific tuning to read files faster? The file is about 17 million lines and it takes about 13 seconds on Windows 7 and about 2.3 seconds on Ubuntu (which is a VM on the same Windows 7 box). I'm using /O2 flags on visual c++ and -O3 on Ubuntu/gcc 4.4.3

Thanks

#include <iostream>
#include <string>
#include <fstream>

using namespace std;

int main(int argc, char *argv[])
{
    const char* test_file_path = argv[1];
    ifstream ifs(test_file_path);

    if (!ifs.is_open()) {
    cout << "Could not open " << test_file_path << endl;
    return 0;
    }

    unsigned long line_count = 1;
    unsigned long sum = 0;
    string line;
    // Go through all the lines in the file
    while (getline(ifs, line)) {
        line_count++;
    }
    cout << line_count << '\n';
    return 0;
}

Edit: Tried boost memory mapped file suggestion by Anders and time dropped to 1.2 seconds. Looks like Ubuntu sort of defaults to this while you need to be explicit on Windows. Thanks Anders.

+1  A: 

Usually, if you're concerned on the performance, it's much better to use your caching rather than system's. Depending on the implementation of getline(), it could be very slow. For instance, a standard fread() implementation uses internal cache of 4kb, so there will be many I/O calls. So please, implement some caching and use your implementation of getline().

But in this particular case, I think the problem is somewhere else. When reading from a standard HDD you'll achieve about 80-100mb/s, and this gives about 10 seconds in best case for your particular file. The processing logic in your case is very simple, so the HDD will be the bottleneck. There are two possible reasons for that difference:

1) Caching of the file by Windows (first test in windows put it in cache, so the next read of the same file from VM was from system cache)

2) If the file is in the VM's file system and VM used compression for HDD (considering this file a text file, it will be highly compressable), then the actual amount of HDD I/O will be much less.

ruslik
The second option is correct. I think there is some compression by the VM going on. The reported disk reading speed is 70MB/s so hard to see how it finished reading the 900MB file in less than 3 seconds. Surprisingly wc -l runs in less than a second.
+2  A: 

If you are concerned with performance I would recommend to step out of the C++ world and into Win32 API file handling (e.g. memory mapped files, boost has a library for that).

Anders K.
Thanks Anders. However, I prefer to keep it to standard C++ and was looking for any quick fixes/flags to match the Ubuntu VM performance. Looks like the VM added some compresssion + some smart system file cache in Ubuntu.
well if you use boost the standard C++ still applies and OS details are hidden (AFAIK).
Anders K.
That was a great suggestion Anders. Tried the boost memory mapped files suggestion and saw the time drop to just about 1 sec. That included a custom getline on the memory mapped file. Thanks.