Fast file I/O is less about the specific API calls you make, but rather about how you architect your application to work with I/O.
If you are performing all of your I/O operations on a single thread in a sequential manner, for example
- Read block into memory
- Process block in memory somehow
- Write block out to file
- Repeat until done...
you are bottlenecking the system's I/O bandwidth in the processing loop of a single thread. An alternative, but more complicated design is to multithread your application to maximize throughput and avoid wait time. This allows the system to take advantage of both CPU and I/O controller bandwidth simultaneously. A typical design for this would look something like:
- One (or more) worker threads read data from disk and add them to a shared input queue
- One (or more) worker threads read blocks from the shared input queue, process them and add them to a shared output queue
- One (or more) worker threads read processed blocked from the shared output queue and write them to the appropriate output files.
This is not an easy architecture to design right, and requires quite a bit of thought to avoid creating in-memory lock contention, or overwhelm the system with concurrent I/O requests. You also need to provide control metadata so that the state of output processing is not managed on the call stack of a thread but rather in the input/output work queues. You also have to make sure that you transform and write the output in the correct order, since with multi-threaded I/O you can't be sure work is placed on the input queue in a guaranteed order. It's complicated - but it is possible, and it can have a dramatic difference in throughput over a serial approach.
If you really have the time and want to squeeze every ounce of performance from the system, you could also use I/O completion ports - a relatively low-level API - to maximize throughput.
Good luck.