views:

75

answers:

4

I need to read (scan) a file sequentially and process its content. File size can be anything from very small (some KB) to very large (some GB).

I tried two techniques using VC10/VS2010 on Windows 7 64-bit:

  1. Win32 memory mapped files (i.e. CreateFile, CreateFileMapping, MapViewOfFile, etc.)
  2. fopen and fread from CRT.

I thought that memory mapped file technique could be faster than CRT functions, but some tests showed that the speed is almost the same in both cases.

The following C++ statements are used for MMF:

HANDLE hFile = CreateFile(
    filename,
    GENERIC_READ,
    FILE_SHARE_READ,
    NULL,
    OPEN_EXISTING,
    FILE_FLAG_SEQUENTIAL_SCAN,
    NULL
    );

HANDLE hFileMapping = CreateFileMapping(
    hFile,
    NULL,
    PAGE_READONLY,
    0,
    0,
    NULL
    );

The file is read sequentially, chunk by chunk; each chunk is SYSTEM_INFO.dwAllocationGranularity in size.

Considering that speed is almost the same with MMF and CRT, I'd use CRT functions because they are simpler and multi-platform. But I'm curious: am I using the MMF technique correctly? Is it normal that MMF performance in this case of scannig file sequentially is the same as CRT one?

Thanks.

+3  A: 

I believe you'll not see much difference if you access the file sequentially. Because file I/O is very heavily cached, + read-ahead is probably also used.

The thing would be different if you had many "jumps" during the file data processing. Then, each time setting a new file pointer and reading a new file portion will probably kill CRT, whereas MMF will give you the maximum possible performance

valdo
+2  A: 

Since you are scanning the file sequentially I would not expect disk usage pattern to be much different for either method.

For large files, MMF might reduce data locality and even result in a copy of all or part of the file being placed in the pagefile, whereas processing via CRT using a small buffer would all take place in RAM. In this instance, MMF would probably be slower. You can mitigate this by only mapping in part of the underlying file at a time, but then things get more complex without any likely win over direct sequential I/O.

MMF are really the way Windows implements inter-process shared memory, rather than a way to speed up generalized file I/O. The file manager cache in the kernel is what you really need to leverage here.

Steve Townsend
A: 

Both methods will eventually come down to disk i/o, that will be your bottleneck. I would go with one method that my higher level functionality likes more - if i have need streaming, I'll go with files, if I need sequential access and fixed size files, I would consider memory mapped files.

Or, in case when you have an algorithm that works only on memory, then mem-mapped files can be easier way out.

Daniel Mošmondor
+1  A: 

I thought that memory mapped file technique could be faster than CRT functions, but some tests showed that the speed is almost the same in both cases.

You are probably hitting the file system cache for your tests. Unless you explicitly create file handles to bypass the file system cache (FILE_FLAG_NO_BUFFERING when calling CreateFile), the file system cache will kick in and keep recently accessed files in memory.

There is a small speed difference between reading a file that is in the file system cache with buffering turned on, as the operating system has to perform an extra copy, as well as system call overhead. But for your purposes, you should probably stick with the CRT file functions.

Gustavo Duarte has a great article on memory mapped files (from a generic OS perspective).

MSN