I'm writing an external merge sort. It works like that: read k chunks from big file, sort them in memory, perform k-way merge, done. So I need to sequentially read from different portions of the file during the k-way merge phase. What's the best way to do that: several ifstreams or one ifstream and seeking? Also, is there a library for easy async IO?
Use one ifstream
at a time on the same file. More than one wastes resources, and you'd have to seek anyway (because by default the ifstream
's file pointer starts at the beginning of the file).
As for a C++ async IO library, check out this question.
EDIT: I originally misunderstood what you are trying to do (this Wikipedia article filled me in). I don't know how much ifstream
buffers by default, but you can turn off buffering by using the pubsetbuf(0, 0);
method described here, and then do your own buffering. This may be slower, however, than using multiple ifstream
s with automatic buffering. Some benchmarking is in order.
Definitely try the multiple streams. Seeking probably throws away internally buffered data (at least within the process, even if the OS retains it in cache), and if the items you're sorting are small that could be very costly indeed.
Anyway, it shouldn't be too hard to compare the performance of your two fstream strategies. Do a simple experiment with k = 2.
Note that there may be a limit on the number of simultaneous open files one process can have (ulimit -n
). if you reach that, then you might want to consider using a single stream, but buffering data from each of your k chunks manually.
It might be worth mmapping the file and using multiple pointers, if the file is small enough (equivalently: your address space is large enough).