ansaurus

Question

Answer 1

+2 A:

When fread fails, it sets errno to indicate why it failed. What is the value of errno after the call to fread that returns zero?

Update: Are you required to read the entire file in one fell swoop? What happens if you read in the file, say, 512MB at a time?

According to your comment above, you are using a 32-bit OS. In that case, you will be unable to handle 6 GB at a time (for one, size_t won't be able to hold that large of a number). You should, however, be able to read in and process the file in smaller chunks.

I would argue that reading a 6GB file into memory is probably not the best solution to your problem even on a 64-bit OS. What exactly are you trying to accomplish that is requiring you to buffer a 6GB file? There's probably a better way to approach the problem.

bta 2010-07-20 00:25:18

After fread() returns 0, the errno value it returns is: 22Results:Total size of file: 6448619520size of file in get buffer: 6448619520result = 0errno = 22.File failed to read

jdd 2010-07-20 00:39:12

Reading it in at one sector at a time would be fine. Its just I have gotten used to working with small test files that when I got to test it on bigger files, I did not take into account the memory I would not have available...ty.

jdd 2010-07-20 01:25:16

your question is a good one, I should of realized early on that reading something that large into memory would be less than feasible and inefficient. So yes, a small redesign is in order by breaking up the reads into small chunks, process each chunks until I reach the end of the file. The overall reason for this program is to grab hex offset patterns within dd images or VMFS partitions. Some VMFS partitions tend to get extremely large, such as 200+GB in size or bigger. Again, not the best way to approach it, I see that now.

jdd 2010-07-20 01:49:47

Answer 2

+4 A:

If you're just going to be reading through the file, not modifying it, I suggest using mmap(2) instead of fread(3). This should be much more efficient, though I haven't tried it on huge files. You'll need to change my very simplistic found/not found to report offsets if that is what you would rather have, but I'm not sure what you want the pointer for. :)

#define _GNU_SOURCE
#include <string.h>

#include <fcntl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>


int main(int argc, char* argv[]) {
    char *base, *found;
    off_t len;
    struct stat sb;
    int ret;
    int fd;
    unsigned int needle = 0x45004E00;

    ret = stat(argv[1], &sb);
    if (ret) {
            perror("stat");
            return 1;
    }

    len = sb.st_size;

    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
            perror("open");
            return 1;
    }

    base = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
    if (!base) {
            perror("mmap");
            return 1;
    }

    found = memmem(base, len, &needle, sizeof(unsigned int));
    if (found)
            printf("Found %X at %p\n", needle, found);
    else
            printf("Not found");
    return 0;
}

Some tests:

$ ./mmap ./mmap
Found 45004E00 at 0x7f8c4c13a6c0
$ ./mmap /etc/passwd
Not found

sarnold 2010-07-20 00:54:29

I have always used malloc before, never used mmap before...saw a good explanation on the two here: http://stackoverflow.com/questions/1739296/malloc-vs-mmap-in-c

jdd 2010-07-20 01:06:49

In my point of view using `fread` into such a large buffer is an abuse. Even if you want to change the buffer after reading you should use `mmap`: giving `MAP_PRIVATE` in flags for `mmap` ensures that all changes of your copy stay in memory and are not re-written to the file. `mmap` is much more efficient since it doesn't need to swap out any page of `bfr` as long as you don't modify them. All pages are directly handled in the page cache. If you really don't want to change the contents, map it read-only, then even several instances of your program could co-exist using the same physical pages.

Jens Gustedt 2010-07-20 06:28:34

Answer 3

+4 A:

If this is a 32 bit process, as you say, then size_t is 32 bit and you simply cannot store more than 4GB in your process's address space (actually, in practice, a bit less than 3GB). In this line here:

bfr = (char*) malloc(sizeof(char) * size);

The result of the multiplication will be reduced modulo SIZE_MAX + 1, which means it'll only try and allocate around 2GB. Similarly, the same thing happens to the size parameter in this line:

result = fread(bfr, sizeof(char), size, fptr);

If you wish to work with large files in a 32 bit process, you have to work on only a part of them at a time (eg. read the first 100 MB, process that, read the next 100 MB, ...). You can't read the entire file in one go - there just isn't enough memory available to your process to do that.

caf 2010-07-20 01:10:53

good point which was what I was most curious about. Looks like I will have to read through x MB per file read until I reach the end of the file size.

jdd 2010-07-20 01:20:56

Answer 4

A:

Have you verified that malloc and fread are actually taking in the right type of parameters? You may want to compile with the -Wall option and check if your 64-bit values are actually being truncated. In this case, malloc won't report an error but would end up allocating far less than what you had asked for.

casablanca 2010-07-20 01:12:05

Answer 5

A:

After taking the advice of everyone, I broke the 6GB file up into 4K chunks, parsed the hex bytes and was able to get what the byte locations which will help me later when I pull out MBR from a VMFS partition that has been dd imaged. Here was the quick and dirty way of reading it per chunk:

#define DEFAULT_BLOCKSIZE 4096
...

while((bytes_read = fread(chunk, sizeof(unsigned char), sizeof(chunk), fptr)) > 0) {
    chunkptr = chunk;
    for(z = 0; z < bytes_read; z++) {
        if (*chunkptr == pattern_buffer[current_search]) {
            current_search++;
            if (current_search > (counter - 1)) {
                current_search = 0;
                printf("Hex string %s was found at starting byte location:  %lld\n",
                       hexstring, (long long int) (offsetctr-1));
                matches++;
            }
        } else {
            current_search = 0;
        }
        chunkptr++;
        //printf("[%lld]: %02X\n", offsetctr, chunk[z] & 0xff);
        offsetctr++;
    }
    master_counter += bytes_read;
}

...

and here were the results I got...

root@redbox:~/workspace/bytelocator/Debug# ./bytelocator -x BF1B0650 -i /data/images/sixgbimage.img 

Total size of /data/images/sixgbimage.img file:  6448619520 bytes
Parsing the hex string now: BF1B0650

Hex string BF1B0650 was found at starting byte location:  18
Hex string BF1B0650 was found at starting byte location:  193885738
Hex string BF1B0650 was found at starting byte location:  194514442
Hex string BF1B0650 was found at starting byte location:  525033370
Hex string BF1B0650 was found at starting byte location:  1696715251
Hex string BF1B0650 was found at starting byte location:  1774337550
Hex string BF1B0650 was found at starting byte location:  2758859834
Hex string BF1B0650 was found at starting byte location:  3484416018
Hex string BF1B0650 was found at starting byte location:  3909721614
Hex string BF1B0650 was found at starting byte location:  3999533674
Hex string BF1B0650 was found at starting byte location:  4018701866
Hex string BF1B0650 was found at starting byte location:  4077977098
Hex string BF1B0650 was found at starting byte location:  4098838010


Quick stats:
================
Number of bytes that have been read:  6448619520
Number of signature matches found:  13
Total number of bytes in hex string:  4

jdd 2010-07-23 00:40:12

I think your new program will miss copies of your byte-string that span two blocks: say, `BF1B` in block 0, `0650` in block 1.

sarnold 2010-07-23 00:53:41

Yes, you are correct. That is one of the limitations for now, the hex pattern search between blocks of hex only searches for those patterns within those 4k blocks, not patterns that span two contiguous blocks - thats a must fix!

jdd 2010-07-23 10:40:02

ansaurus

tags:

views:

answers:

fread() on 6gb file fails.

related questions