tags:

views:

525

answers:

7

What is the fastest way to read every 30th byte of a large binary file (2-3 GB)? I've read there are performance problems with fseek because of I/O buffers, but I don't want to read 2-3 GB of data into memory before grabbing every 30th byte either.

+22  A: 

What I'd suggest is that you create a buffer of a few thousand bytes, read every 30th byte from it, reload the buffer with the next few thousand bytes, and continue until you reach the eof. That way the amount of data read into memory is limited, and you also don't have to read from the file as often. You'll find that the larger the buffer you create, the faster it'll be.

Edit: Actually, as suggested below, you'll probably want to make your buffer a few hundred kb's, not a few thousand bytes (like I said - bigger buffer = faster file read).

Cam
+1 -- was just writing almost exactly the same thing -- except I recommended a few hundred kilobytes per chunk.
Jerry Coffin
Yeah, that's probably better. I mean if the file is that large, he's obviously in an environment where he can afford a buffer bigger than a few thousand bytes :) (edited answer)
Cam
Bigger buffer = fewer reads = speed.
Steven Sudit
I predict that compared with the default buffering strategy used in the standard I/O library, the benefits of this scheme won't even be measurable (for a program reading every 30th byte). I would be pleased to see measurements proving me wrong.
Norman Ramsey
@Norman Ramsey: I predict otherwise. Test currently running, I'll post a CW answer shortly.
Steve Jessop
+10  A: 

Well, you can read a byte and then seek 29 bytes in a loop. But the IO subsystem has to read from the file by sectors, which are typically 512 bytes in size, so it will still end up reading the whole file.

In the long run, it will be faster to just read the whole file in chunks that are a multiple of your step size, and then just look in the buffer. You'll make your life a bit simpler if you make sure that you buffer size is a multiple of 30, and you make the fileio subsystem's life easier if it's a multiple of 512.

while (still more file to read)
{ 
   char buf[30 * 512];
   int cread = fread (buf, sizeof(buf), 1, fd);
   for (int ii = 0; ii < cread; ii += 30)
   {

   }
}

This may look inefficient, but it will work out to be faster than trying to read in 30 byte chunks.

By the way. If you are running on Windows, and willing to be OS specific, you really can't beat the performance of memory mapped files. http://stackoverflow.com/questions/2171625/how-to-scan-through-really-huge-files-on-disk/2234662#2234662

John Knoeller
It's an important point that the sector size means that the OS will be reading the entire file regardless.
caf
Windows isn't the only platform with memory-mapped files, of course.
Ken
@Ken: I have no first hand knowledge of how mmap performs relative to fread, and the sample code I link to is Windows only.
John Knoeller
+9  A: 

If you're willing to break out of ANSI-C and use OS specific calls, I'd recommend using memory mapped files. This is the Posix version (Windows has it's own OS specific calls):

#define MAPSIZE 4096
int fd = open(file, O_RDONLY);
struct stat stbuf;
fstat(fd, &stbuf);


char *addr = 0;
off_t last_mapped_offset = -1;
off_t idx = 0;
while (idx < stbuf.st_size)
{
    if (last_mapped_offset != (idx / MAPSIZE))
    {
        if (addr)
            munmap(addr, MAPSIZE);

        last_mapped_offset = idx / MAPSIZE; 

        addr = mmmap(0, MAPSIZE, PROT_READ, MAP_FILE, fd, idx, last_mapped_offset);
    }

    *(addr + (idx % MAPSIZE));

    idx += 30;

}

munmap(addr, MAPSIZE);
close(fd);
R Samuel Klatchko
Would typical POSIX-based OSes still perform read-ahead when you only `mmap()` one page at a time and never call `madvise()`?
bk1e
By the way, `mmap()` uses `SIGBUS` to report errors that occur after the file is mapped. This is much harder to deal with correctly than errors from `read()` or `fread()`.
bk1e
+1  A: 

You almost certainly don't need to worry about it. The runtime may well buffer the last block that it read for each file handle. Even if it doesn't, the operating system is caching file accesses for you.

That said, if you read a block at a time, you do save on call overheads to the fseek and fread functions. The bigger the block you read at once, the more you save on call overheads - though other costs obviously start making themselves felt beyond a certain point.

Steve314
+3  A: 
Norman Ramsey
Prepare to be shocked, fseek causes a big slowdown on my machine (NTFS, Windows XP, cygwin).
Steve Jessop
@Steve: cool! Compared with what???
Norman Ramsey
@Steve: I'm pretty skeptical about cygwin. I'd love to know how performance compares with the Microsoft C compiler and library (identical code).
Norman Ramsey
+10  A: 

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
    FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
    int total = 0;
    #if SEEK
        long long i = 0;
        char buf[1];
        while (i < size) {
            fread(buf, 1, 1, in);
            total += (unsigned char) buf[0];
            fseek(in, step - 1, SEEK_CUR);
            i += step;
        }
    #endif
    #ifdef BUFSZ
        long long i = 0;
        char buf[BUFSZ];
        while (i < size) {
            fread(buf, BUFSZ, 1, in);
            i += BUFSZ;
            for (int j = 0; j < BUFSZ; j += step) 
                total += (unsigned char) buf[j];
        }
    #endif
    printf("%d\n", total);
}

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m1.391s
user    0m0.030s
sys     0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.172s
user    0m0.108s
sys     0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real    0m0.031s
user    0m0.030s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real    0m0.141s
user    0m0.140s
sys     0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real    0m20.797s
user    0m1.733s
sys     0m9.140s

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m33.437s
user    0m0.749s
sys     0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.078s
user    0m5.030s
sys     0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.141s
user    0m0.280s
sys     0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real    0m6.094s
user    0m4.968s
sys     0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real    0m1.140s
user    0m0.171s
sys     0m0.640s

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m25.515s
user    0m5.155s
sys     0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real    3m59.015s
user    1m11.061s
sys     0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real    3m42.423s
user    0m5.577s
sys     0m14.484s

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

Steve Jessop
Very cool. But `fread` is not optimized for 1 char. Can you try `fgetc`?
Norman Ramsey
fgetc vs. fread makes no difference that I can detect in 4 test runs of each (with MEGS=20, data pre-loaded). Range of results is 19.4s to 21.2s, with the best and worst both using fgetc. I expect other people's mileage to vary - I don't know to what extent cygwin+gcc is using unmodified glibc, and I don't know whether there's some peculiarity of Windows responsible for the performance hit on fseek. You'd think that a forward seek of 31 bytes "should" most of the time just increment an offset in the FILE*, but apparently not.
Steve Jessop
@Steve: I traced it; the sucker makes a system call on every `fseek`. What idiots! I changed your program to use Phong Vo's sfio library, and at that point the differences are still there but they are reasonably small. Thanks for posting such a useful program. Oh, and +1 :-)
Norman Ramsey
Thanks, Norman. Number 1 rule of performance questions: it's usually really easy to write a half-assed benchmark, and a half-assed benchmark is usually enough to reveal serious performance disasters :-)
Steve Jessop
A: 

If you are reading data from a hard disk with a spinning platter the answer is you read the whole file sequentially using a large buffer and discard the portions in memory you don't want.

The smallest unit of access possible to a standard hard disk drive is the sector. Sector sizes for all common spinning disk drives are many times more than 30 bytes. This means the hard disk controller must access each and every sector anyway regardless of what the request from the host looks like. There is no low level magic possible to change this.

Even if this was not the case and you could read individual bytes there is a huge premium for seek vs sequential read operations. The best possible case is still the same as sequential read. In the real world I wouldn't be surprised if signaling overhead would preclude such schemes from working even with a massive command buffer.

Einstein