tags:

views:

155

answers:

5

Ok, I have been reading up on fread() [which returns a type size_t]and saw several posts regarding large files and some issues others have been having - but I am still having some issues. This function passes in a file pointer and a long long int. The lld is from main where I use another function to get the actual filesize which is 6448619520 bytes.

char *getBuffer(FILE *fptr, long long size) {
    char *bfr;
    size_t result;

    printf("size of file in allocate buffer:  %lld\n", size);
        //size here is 6448619520


    bfr = (char*) malloc(sizeof(char) * size);
    if (bfr == NULL) {
        printf("Error, malloc failed..\n");
        exit(EXIT_FAILURE);
    }
        //positions fptr to offset location which is 0 here.
    fseek(fptr, 0, SEEK_SET);
        //read the entire input file into bfr
    result = fread(bfr, sizeof(char), size, fptr);
    printf("result = %lld\n",  (long long) result);


    if(result != size)
    {
        printf("File failed to read\n");
        exit(5);
    }
    return (bfr);

}

I have tested it on files of around 1-2gb in size and it works fine, however, when I test it on a 6gb file, nothing is read in to the buffer. Ignore the other results, (focus on the bolded for results), the issue lies with reading in the data bfr. Here are some of the results I get.

1st of a file that is 735844352 bytes (700+MB)

root@redbox:/data/projects/C/stubs/# ./testrun -x 45004E00 -i /data/Helix2008R1.iso

Image file is /data/Helix2008R1.iso
hex string = 45004E00
>Total size of file: 735844352
size of file in get buffer: 735844352
result = 735844352

** Begin parsing the command line hex value: 45004E00
Total number of bytes in hex string: 4

Results of hex string search:
Hex string 45004E00 was found at byte location: 37441
Hex string 45004E00 was found at byte location: 524768
....

Run #2 against a 6gb file: root@redbox:/data/projects/C/stubs/# ./testrun -x BF1B0650 -i /data/images/sixgbimage.img

Image file is /data/images/sixgbimage.img
hex string = BF1B0650
Total size of file: 6448619520
size of file in allocate buffer: 6448619520
result = 0
File failed to read

I am still not sure why it it failing with large files and not smaller ones, is it a >4gb issue. I am using the following:

/* Support Large File Use */
#define _LARGEFILE_SOURCE 1
#define _LARGEFILE64_SOURCE 1
#define _FILE_OFFSET_BITS   64

BTW, I am using an ubuntu 9.10 box (2.6.x kernel). tia.

+2  A: 

When fread fails, it sets errno to indicate why it failed. What is the value of errno after the call to fread that returns zero?

Update: Are you required to read the entire file in one fell swoop? What happens if you read in the file, say, 512MB at a time?

According to your comment above, you are using a 32-bit OS. In that case, you will be unable to handle 6 GB at a time (for one, size_t won't be able to hold that large of a number). You should, however, be able to read in and process the file in smaller chunks.

I would argue that reading a 6GB file into memory is probably not the best solution to your problem even on a 64-bit OS. What exactly are you trying to accomplish that is requiring you to buffer a 6GB file? There's probably a better way to approach the problem.

bta
After fread() returns 0, the errno value it returns is: 22Results:Total size of file: 6448619520size of file in get buffer: 6448619520result = 0errno = 22.File failed to read
jdd
Reading it in at one sector at a time would be fine. Its just I have gotten used to working with small test files that when I got to test it on bigger files, I did not take into account the memory I would not have available...ty.
jdd
your question is a good one, I should of realized early on that reading something that large into memory would be less than feasible and inefficient. So yes, a small redesign is in order by breaking up the reads into small chunks, process each chunks until I reach the end of the file. The overall reason for this program is to grab hex offset patterns within dd images or VMFS partitions. Some VMFS partitions tend to get extremely large, such as 200+GB in size or bigger. Again, not the best way to approach it, I see that now.
jdd
+4  A: 

If you're just going to be reading through the file, not modifying it, I suggest using mmap(2) instead of fread(3). This should be much more efficient, though I haven't tried it on huge files. You'll need to change my very simplistic found/not found to report offsets if that is what you would rather have, but I'm not sure what you want the pointer for. :)

#define _GNU_SOURCE
#include <string.h>

#include <fcntl.h>
#include <sys/mman.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>


int main(int argc, char* argv[]) {
    char *base, *found;
    off_t len;
    struct stat sb;
    int ret;
    int fd;
    unsigned int needle = 0x45004E00;

    ret = stat(argv[1], &sb);
    if (ret) {
            perror("stat");
            return 1;
    }

    len = sb.st_size;

    fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
            perror("open");
            return 1;
    }

    base = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
    if (!base) {
            perror("mmap");
            return 1;
    }

    found = memmem(base, len, &needle, sizeof(unsigned int));
    if (found)
            printf("Found %X at %p\n", needle, found);
    else
            printf("Not found");
    return 0;
}

Some tests:

$ ./mmap ./mmap
Found 45004E00 at 0x7f8c4c13a6c0
$ ./mmap /etc/passwd
Not found
sarnold
I have always used malloc before, never used mmap before...saw a good explanation on the two here: http://stackoverflow.com/questions/1739296/malloc-vs-mmap-in-c
jdd
In my point of view using `fread` into such a large buffer is an abuse. Even if you want to change the buffer after reading you should use `mmap`: giving `MAP_PRIVATE` in flags for `mmap` ensures that all changes of your copy stay in memory and are not re-written to the file. `mmap` is much more efficient since it doesn't need to swap out any page of `bfr` as long as you don't modify them. All pages are directly handled in the page cache. If you really don't want to change the contents, map it read-only, then even several instances of your program could co-exist using the same physical pages.
Jens Gustedt
+4  A: 

If this is a 32 bit process, as you say, then size_t is 32 bit and you simply cannot store more than 4GB in your process's address space (actually, in practice, a bit less than 3GB). In this line here:

bfr = (char*) malloc(sizeof(char) * size);

The result of the multiplication will be reduced modulo SIZE_MAX + 1, which means it'll only try and allocate around 2GB. Similarly, the same thing happens to the size parameter in this line:

result = fread(bfr, sizeof(char), size, fptr);

If you wish to work with large files in a 32 bit process, you have to work on only a part of them at a time (eg. read the first 100 MB, process that, read the next 100 MB, ...). You can't read the entire file in one go - there just isn't enough memory available to your process to do that.

caf
good point which was what I was most curious about. Looks like I will have to read through x MB per file read until I reach the end of the file size.
jdd
A: 

Have you verified that malloc and fread are actually taking in the right type of parameters? You may want to compile with the -Wall option and check if your 64-bit values are actually being truncated. In this case, malloc won't report an error but would end up allocating far less than what you had asked for.

casablanca
A: 

After taking the advice of everyone, I broke the 6GB file up into 4K chunks, parsed the hex bytes and was able to get what the byte locations which will help me later when I pull out MBR from a VMFS partition that has been dd imaged. Here was the quick and dirty way of reading it per chunk:

#define DEFAULT_BLOCKSIZE 4096
...

while((bytes_read = fread(chunk, sizeof(unsigned char), sizeof(chunk), fptr)) > 0) {
    chunkptr = chunk;
    for(z = 0; z < bytes_read; z++) {
        if (*chunkptr == pattern_buffer[current_search]) {
            current_search++;
            if (current_search > (counter - 1)) {
                current_search = 0;
                printf("Hex string %s was found at starting byte location:  %lld\n",
                       hexstring, (long long int) (offsetctr-1));
                matches++;
            }
        } else {
            current_search = 0;
        }
        chunkptr++;
        //printf("[%lld]: %02X\n", offsetctr, chunk[z] & 0xff);
        offsetctr++;
    }
    master_counter += bytes_read;
}

...

and here were the results I got...

root@redbox:~/workspace/bytelocator/Debug# ./bytelocator -x BF1B0650 -i /data/images/sixgbimage.img 

Total size of /data/images/sixgbimage.img file:  6448619520 bytes
Parsing the hex string now: BF1B0650

Hex string BF1B0650 was found at starting byte location:  18
Hex string BF1B0650 was found at starting byte location:  193885738
Hex string BF1B0650 was found at starting byte location:  194514442
Hex string BF1B0650 was found at starting byte location:  525033370
Hex string BF1B0650 was found at starting byte location:  1696715251
Hex string BF1B0650 was found at starting byte location:  1774337550
Hex string BF1B0650 was found at starting byte location:  2758859834
Hex string BF1B0650 was found at starting byte location:  3484416018
Hex string BF1B0650 was found at starting byte location:  3909721614
Hex string BF1B0650 was found at starting byte location:  3999533674
Hex string BF1B0650 was found at starting byte location:  4018701866
Hex string BF1B0650 was found at starting byte location:  4077977098
Hex string BF1B0650 was found at starting byte location:  4098838010


Quick stats:
================
Number of bytes that have been read:  6448619520
Number of signature matches found:  13
Total number of bytes in hex string:  4
jdd
I think your new program will miss copies of your byte-string that span two blocks: say, `BF1B` in block 0, `0650` in block 1.
sarnold
Yes, you are correct. That is one of the limitations for now, the hex pattern search between blocks of hex only searches for those patterns within those 4k blocks, not patterns that span two contiguous blocks - thats a must fix!
jdd