tags:

views:

127

answers:

3

I've got very strange problem on my Windows XP in VirtualBox.

ReadFile() function refuses to read more than 16Mb of data in single call. It returns error code 87 (ERROR_INVALID_ARGUMENT). Looks like data length is limited to 24 bits.

Here is the example code allowed me to find out exact limit.

#include <conio.h>
#include <stdio.h>
#include <fcntl.h>
#include <io.h>
#include <sys/stat.h>

int _tmain(int argc, _TCHAR* argv[])
{
    int fd,len,readed;
    char *buffer;
    char *fname="Z:\\test.dat";
    fd=_open(fname,_O_RDWR|_O_BINARY,_S_IREAD|_S_IWRITE);
    if (fd==-1) {
        printf("Error opening file : %s\n",strerror(errno));
        getch();
        return -1;
    }
    len=_lseek(fd,0,SEEK_END);
    _lseek(fd,0,SEEK_SET);
    if (!len) {
        printf("File length is 0.\n");
        getch();
        return -2;
    }
    buffer=(char *)malloc(len);
    if (!buffer) {
        printf("Failed to allocate memory.\n");
        getch();
        return -3;
    }
    readed=0;
    while (readed<len) {
        len-=100;
        readed=_read(fd,buffer,len);
        if (len<=100) break;
    }
    if (readed!=len) {
        printf("Failed to read file: result %d error %s\n",readed,strerror(errno));
        getch();
        return -4;
    }
    _close(fd);
    printf("Success (%u).",len);
    getch();
    return 0;
}

File Z:\test.dat length is 21Mb.

Result is "Success (16777200)."

I was trying to find same issues in Google without any success :(

May be someone knows what is the cause of the problem?

+4  A: 

It is entirely legal for a device driver to return less bytes than requested. That's why ReadFile() has the lpNumberOfBytesRead argument. You should avoid the low-level CRT implementation details, like _read(). Use fread() instead.

Update: this isn't the correct answer. It looks like your virtual machine simply refuses to consider ReadFile() calls that ask for more than 16MB. Probably has something to do with an internal buffer it uses to talk to the host operating system. Nothing you can do but call fread() in a loop so you can stay below this upper limit.

Hans Passant
ReadFile() returns -1.It does not read anything if supplied nNumberOfBytesRead is greater than 16Mb.Error code is ERROR_INVALID_ARGUMENT.
mephisto123
Okay, it's allowed to bitch about that as well. Use fread().
Hans Passant
just replaced _read() and so with f*(). Same issue.
mephisto123
I even tried to use fread(buffer,len/2,2,f) - it still fails for len>16Mb.
mephisto123
What is the runtime environment like? Who created the CRT? You probably will have to punt this and call fread() for partial reads in a loop.
Hans Passant
I've traced fread() and _read() calls to ReadFile() call. So the problem definitely not in CRT. It is WinAPI function what returns this error. fread() and _read() are just wrappers for this ReadFile() call.Yes I'm going to apply this little 'hack' and split file to smaller blocks. But I'm still curious about the cause of the problem.
mephisto123
The cause of the problem has nothing to do with ReadFile() itself. The looping code is buggy to begin with. See my other answer.
Remy Lebeau - TeamB
Commented your answer, you didn't read my question :(
mephisto123
+2  A: 

I would recommend that you use Memory-Mapped Files. (see also http://msdn.microsoft.com/en-us/library/aa366556.aspx). The following simple code shows one way to do this:

LPCTSTR pszSrcFilename = TEXT("Z:\\test.dat");
HANDLE hSrcFile = CreateFile (pszSrcFilename, GENERIC_READ, FILE_SHARE_READ,
                              NULL, OPEN_EXISTING,
                              FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
                              NULL);
HANDLE hMapSrcFile = CreateFileMapping (hSrcFile, NULL, PAGE_READONLY, 0, 0, NULL);
PBYTE pSrcFile = (PBYTE) MapViewOfFile (hMapSrcFile, FILE_MAP_READ, 0, 0, 0);
DWORD dwInFileSizeHigh, dwInFileSizeLow;
dwInFileSizeLow = GetFileSize (hInFile, &dwInFileSizeHigh);

After some simple steps you have a pointer pSrcFile which represent the whole file contents. Is this not what you need? The total size of the memory block in stored in dwInFileSizeHigh and dwInFileSizeLow: ((__int64)dwInFileSizeHigh << 32)+dwInFileSizeLow.

This uses the same feature of the Windows kernel that is used to implement the swap file (page file). It is buffered by the disk cache and very efficient. If plan to access the file mostly sequentially, including the flag FILE_FLAG_SEQUENTIAL_SCAN in the call to CreateFile() will hint this fact to the system, causing it to try to read ahead for even better performance.

I see that file which you read in the test example has the name "Z:\test.dat". If it is a file coming from a network drive you will see a clear performance advantage. Morover corresponds with http://msdn.microsoft.com/en-us/library/aa366542.aspx you hav the limit about 2 GB instead of 16Mb. I recommend you to map files till 1 GB and then just create a net view with respect of MapViewOfFile (I am not sure that you code need work with so large files). More then that, on the same MSDN page you can read following

The size of the file mapping object that you select controls how far into the file you can "see" with memory mapping. If you create a file mapping object that is 500 Kb in size, you have access only to the first 500 Kb of the file, regardless of the size of the file. Since it does not cost you any system resources to create a larger file mapping object, create a file mapping object that is the size of the file (set the dwMaximumSizeHigh and dwMaximumSizeLow parameters of CreateFileMapping both to zero) even if you do not expect to view the entire file. The cost in system resources comes in creating the views and accessing them.

So the usage of memory mapped files is really cheap. If your program reads only portions of the file contents skipping large parts of the file, then you will also have a large performance advantage because it will read only the parts of file which you really accessed (rounded to 16K pages).

More clean code for for file mapping is following

DWORD MapFileInMemory (LPCTSTR pszFileName,
                       PBYTE *ppbyFile,
                       PDWORD pdwFileSizeLow, OUT PDWORD pdwFileSizeHigh)
{
    HANDLE  hFile = INVALID_HANDLE_VALUE, hFileMapping = NULL;
    DWORD dwStatus = NO_ERROR;
    const DWORD dwSourceId = MSG_SOURCE_MAP_FILE_IN_MEMORY;

    __try {
        hFile = CreateFile (pszFileName, FILE_READ_DATA, 0, NULL, OPEN_EXISTING,
                            FILE_ATTRIBUTE_NORMAL | FILE_FLAG_SEQUENTIAL_SCAN,
                            NULL);
        if (hFile == INVALID_HANDLE_VALUE) {
            dwStatus = GetLastError();
            __leave;
        }

        *pdwFileSizeLow = GetFileSize (hFile, pdwFileSizeHigh);
        if (*pdwFileSizeLow == INVALID_FILE_SIZE){
            dwStatus = GetLastError();
            __leave;
        }

        hFileMapping = CreateFileMapping (hFile, NULL, PAGE_READONLY, 0, 0, NULL);
        if (!hFileMapping){
            dwStatus = GetLastError();
            __leave;
        }

        *ppbyFile = (PBYTE) MapViewOfFile (hFileMapping, FILE_MAP_READ, 0, 0, 0);
        if (*ppbyFile == NULL) {
            dwStatus = GetLastError();
            __leave;
        }
    }
    __finally {
        if (hFileMapping) CloseHandle (hFileMapping);
        if (hFile != INVALID_HANDLE_VALUE) CloseHandle (hFile);
    }

    return dwStatus;
}

BOOL UnmapFileFromMemory (LPCVOID lpBaseAddress)
{
    return UnmapViewOfFile (lpBaseAddress);
}
Oleg
MMF has pros and cons - cons include: on 32bit you have the problem of address space which severely limits how big files you can map (of course you can map "windows" into your file, but then you lose most of the MMF advantages). On both 32 and 64bit, you must keep in mind that mapping requires additional memory for the pagetable mappings. You have less control of caching. You generate a #PF hardware exception for each 4kb page you access (not a problem on today's CPUs, but it's overhead nonetheless).
snemarch
@snemarch: The #PF hardware exception is the most optimized Windows code. At least this say MS. All costs for MMF together are less then a standard input/output operations. Of the file is very large one need of cause mapping of a part of file. Nevertheless because MMF is the most quickly access to files under Windows I'll use MMF also in this case. If you access to files from the network a simple experiment will show how much will be performance advantage of MMF. Just try it.
Oleg
Thank you for detailed answer, but this is not what I'm asking for. I've already fixed this issue by reading file sequentially by 16Mb chunks. But I'm still curious about the cause of the problem described.The problem is:if I call ReadFile() with size argument greater than 16Mb, it fails with ERROR_INVALID_ARGUMENT result. This drive is not network drive, it is VirtualBox's shared folder.
mephisto123
I don't know about any restriction in `ReadFile`, but your goal is not to use one or another API. I understood, that you want to work with the full contain of the file having it as block of memory. If you write the program for Windows, then the best implementation of this requirement is the usage of memory-mapped files. Just compare performance of your program in case of usage any Read File function and `MapFileInMemory` function which I posted. You will see why I make the recomendation. If your program is started or it load DLL, exactly the same code will be used: memory mapping are created.
Oleg
@Oleg: yes, #PF is optimized and you won't see much of a CPU hit unless you run on really low-end hardware... but you do get a #PF for each page access, and there will be more r3<>r0 transitions - and you aren't able to control buffering strategies. Whether MMF is the optimal access strategy depends on your usage patterns - if you read a lot from the same areas, being able to serve requests directly out of the buffer cache is nice. If you do huge one-pass data processing, it can be advantageous to unbuffered ReadFile to avoid using memory on useless buffering.
snemarch
@snemarch: Sorry but I disagree with your. In `CreateFile` flags you can define the buffering strategies through flags. Any disk operations will be implemented with respect of hardware interrupts. A part of work will be made in user more, a part in kernel mode. #PF are the shortest and the most optimized I/O operation. So you not believe just make a test. In the question one read the file in the memory buffer to work with file as with memory block. For example use the program from the question to calculate MD5 of a file and repeat with the code which I posted. The results give you the answer.
Oleg
I did a quick little test on 64bit Win7 w/8GB ram: mapping a ~3.2GB file makes the process' private bytes rise by around ~7MB. Touching the mapped memory makes commit charge rise by, surprise surprise, ~3.2GB. Finally, FILE_FLAG_NO_BUFFERING is only partially honored - if specified it *does* force re-reading, but if removed a second sum takes a few seconds, so obviously the file *is* cached.
snemarch
@snemarch: Sorry but I can not follow you. Which connection has your experiments to the question asked by mephisto123? What do want to show with your experiments? Correspond to the link http://msdn.microsoft.com/en-us/library/aa366542.aspx which I inserted in my answer you can safe use till "2 GB minus the virtual memory already reserved by the process". You can effective use file-mapping of very large files, can use SEC_LARGE_PAGES, map not the whole file at once and so on. But all this is another question. Could you see that MMF work very good with files 25MB like it was in **the question**?
Oleg
@oleg: all I'm trying to say is that MMF isn't the end-all-be-all solution to file access, and that it has some downsides. If you work with huge files on 32bit systems and thus need windowed access, you lose one of the biggest benefits of MMF, "the file is just one memory area". SEC_LARGE_PAGES only works for pagefile-backed MMF (ie., not for file access). As for topic relevance, MMF isn't very relevant to what's probably a driver bug/limit in VirtualBox :) PS: 16K pages? That'd be 4K pages or 64K allocation-granularity size :)
snemarch
+3  A: 

The problem is not with ReadFile() itself. The real problem is that your while() loop is buggy to begin with. You are mismanaging the len and readed variables. On each iteration of the loop, you decrement len and reset readed. Eventually, len is decremented to a value that matches readed and the loop stops running. The fact that your "Success" message reports 16MB is coincidence, because you are modifying both variables while you read the file. len is initially set to 21MB and counts down until _read() happens to return a 16MB buffer when 16MB was asked for. That does not mean that ReadFile() failed on a 16MB read (if that were the case, the very first loop iteration would fail because it asks for a 21MB read).

You need to fix your while() loop, not blame ReadFile(). The correct looping logic should look more like this instead:

int total = 0; 

while (total < len)
{ 
    readed = _read(fd, &buffer[total], len-total); 
    if (readed < 1) break;
    total += readed;
} 

_close(fd); 

if (total != len)
{ 
    printf("Failed to read file: %d out of %d, error %s\n", total, len, strerror(errno)); 
    ...
    return -4; 
} 

printf("Success (%u).",total); 
...
Remy Lebeau - TeamB
Ben Voigt
Remy Lebeau - TeamB
You did not read my question :( That loop aint supposed to read all the file in a loop. That loop is finding out maximal block size what ReadFile() reads without errors. First, it tries to read all file in one chunk - 21 Mb. ReadFile() returns an error. Then it tries to read 100 bytes lesser block. ReadFile() still returns an error. And so on, it stops when ReadFile() returns success.And this is NOT read error or something... ReadFile() does not try to read the file at all, it just returns ERROR_INVALID_ARGUMENT if block size is more than 16Mb.
mephisto123
I agree with Hans then. When testing your code in a non-virtual XP SP2 environment with a 600MB test file, the first call to _read() was able to return the entire 600MB in a single read (well, 600MB - 100b since you decrement the len before performing the first read) without any error. So it is definately likely that your VirtualBox system is limiting the size you can use for the buffer.
Remy Lebeau - TeamB
Yes I think so. But I still wonder, why VirtualBox does so.
mephisto123
Does it really matter why? It is happening. ReadFile() (and consequently _read()) is allowed to return less than the number of bytes requested, so you have to take the actual return value into account correctly when reading data.
Remy Lebeau - TeamB