views:

143

answers:

3

Problem Description

I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.

In summary:

  • Needs to be constant space
  • Fast as possible
  • Assume very large files
  • Result fits in memory

Question

What are the fastest ways to read/stream data from a file using constant space?

Ideas I've had

If the file was small enough to fit in memory, then mmap on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread?

A: 

Depending on the language you are using, a C-like fread() loop based on a file for which you declared a particular buffer size will require exactly this buffer size, no more no less.

We typically choose a buffer size of 4 to 128 kBytes, there is little gain if any with bigger buffers.

If performance was extremely important, for relatively little gain (and at the risk of re-inventing something), one could consider using a two-thread implementation, whereby one thread reads the file in a set of two buffers, and the other thread perform calculations sequential fashion in one of the buffers at a time. In this fashion the disk access delays can be removed.

mjv
+3  A: 

I wouldn't be so sure that mmap would be very fast (where very fast is defined as significantly faster than fread).

Grep used to use mmap, but switched back to fread. One of the reasons was stability (strange things happen with mmap if the file shrinks whilst it is mapped or an IO error occurs). This page discusses some of the history about that.

You can compare the performance on your system with the option --mmap to grep. On my system the difference in performance on a 200GB file is negligible, but your mileage might vary!

In short, I'd use fread with a fixed size buffer. It's simpler to code, easier to handle errors and will almost certainly be fast enough.

Jeff Foster
A: 

mjv is right. You can use double-buffers and overlapped I/O. That way your crunching and the disk reading can be happening at the same time. Then I would profile or stack-shot the crunching to make it as fast as possible. With luck it will be faster than the I/O, so you will end up running the I/O at top speed without pause. Then things like file fragmentation come into the picture.

Mike Dunlavey