views:

94

answers:

2

I'm writing an external merge sort. It works like that: read k chunks from big file, sort them in memory, perform k-way merge, done. So I need to sequentially read from different portions of the file during the k-way merge phase. What's the best way to do that: several ifstreams or one ifstream and seeking? Also, is there a library for easy async IO?

+2  A: 

Use one ifstream at a time on the same file. More than one wastes resources, and you'd have to seek anyway (because by default the ifstream's file pointer starts at the beginning of the file).

As for a C++ async IO library, check out this question.

EDIT: I originally misunderstood what you are trying to do (this Wikipedia article filled me in). I don't know how much ifstream buffers by default, but you can turn off buffering by using the pubsetbuf(0, 0); method described here, and then do your own buffering. This may be slower, however, than using multiple ifstreams with automatic buffering. Some benchmarking is in order.

Cameron
You wouldn't "have to seek anyway", because with multiple streams you do one seek per chunk of the file (k). Using a single stream to do a multi-way merge, you do roughly one seek per item merged (n), although you might get lucky if a long run of items is coming from a single chunk. The chance of that is negligible if the data is initially in random order, though.
Steve Jessop
+1  A: 

Definitely try the multiple streams. Seeking probably throws away internally buffered data (at least within the process, even if the OS retains it in cache), and if the items you're sorting are small that could be very costly indeed.

Anyway, it shouldn't be too hard to compare the performance of your two fstream strategies. Do a simple experiment with k = 2.

Note that there may be a limit on the number of simultaneous open files one process can have (ulimit -n). if you reach that, then you might want to consider using a single stream, but buffering data from each of your k chunks manually.

It might be worth mmapping the file and using multiple pointers, if the file is small enough (equivalently: your address space is large enough).

Steve Jessop
Is there a limit on the number of open files or open filestreams (I need to access only 1 file, but with many filestreams)?Unfortunately file size is far bigger than memory size. That's exactly the case with external sorting.
SpyBot
Probably streams, but I'm not absolutely certain. When *NIX says "open files", I believe it means the number of file descriptors rather than the number of distinct files to which they point. I mentioned mmap because it doesn't require the file to fit in memory, just that an address range can be given to it (within your 32 or 64 bit virtual address space). The OS then does the job of paging sections of the file in and out of physical RAM as they're used.
Steve Jessop