Problem Description
I need to stream large files from disk. Assume the files are larger than will fit in memory. Furthermore, suppose that I'm doing some calculation on the data and the result is small enough to fit in memory. As a hypothetical example, suppose I need to calculate an md5sum of a 200GB file and I need to do so with guarantees about how much ram will be used.
In summary:
- Needs to be constant space
- Fast as possible
- Assume very large files
- Result fits in memory
Question
What are the fastest ways to read/stream data from a file using constant space?
Ideas I've had
If the file was small enough to fit in memory, then mmap
on POSIX systems would be very fast, unfortunately that's not the case here. Is there any performance advantage to using mmap
with a small buffer size to buffer successive chunks of the file? Would the system call overhead of moving the mmap
buffer down the file dominate any advantages Or should I use a fixed buffer that I read into with fread
?