views:

176

answers:

5

I've two files of 1 GB each containing only numbers in sorted order. Now I know how to read the contents of the files and sort them using merge sort algorithm and output it into an another file but what I'm interested is to how to do this only using 100MB buffer size (I do not worry about the scratch space). For example one way is to read 50 MB chunks from both the files and sort it and as it is sorted I could read a new element and continue the process till I reach the end of both files (Can anyone give me any idea how to implement this).

+6  A: 

Sounds like you only need to merge the numbers in your files, not sort them, since they're already sorted in each file. The merge part of merge sort is this:

function merge(left,right)
    var list result
    while length(left) > 0 or length(right) > 0
        if length(left) > 0 and length(right) > 0
            if first(left) ≤ first(right)
                append first(left) to result
                left = rest(left)
            else
                append first(right) to result
                right = rest(right)
        else if length(left) > 0
            append left to result
            break             
        else if length(right) > 0
            append right to result
            break
    end while
    return result

Now you can just read the first 50 MB of numbers from both files in two buffers, apply the merge algorithm, then when one of the buffers has been exhausted (all its numbers analysed), read another 50 MB from the needed file. There's no need to sort anything.

You just need a condition that checks when one of your buffers is empty. When it is, read more from the file that buffer is associated with.

IVlad
The two files are sorted independently. For ex. the first can have numbers from 1,3,5,...1001 and second can have numbers from 2,4,6,...1000. So, in that case doesn't it need to COMPARE and output the smallest number which is sorting ? I also do understand your point and also wanted to know if there is any way in C/C++ to continually/dynamically insert the elements into the buffer as and when the buffer gets exhausted.
Sunil
@Sunil - no, it's not sorting. It's merging. Merging means building a sorted list from two independently sorted lists, which is exactly your problem. Here's how it would run for your example: compare 1 against 2: 1 is smaller, output 1 and move forward in the first list. Compare 3 against 2: 2 is smaller, output 2 and move forward in the second. Compare 3 against 4: 3 is smaller, output 3 and move forward in the first list. As for how this could be done in C++, consider a STL vector or even STL queue: http://www.cplusplus.com/reference/stl/queue/
IVlad
+3  A: 

You probably want to read/write in reasonable chunks to avoid I/O overhead. So probably use three buffers of ~30M, input1, input2 and output.

Keep going until either one of the input buffers is empty or the output buffer is full, then read/write to refill/empty the empty/full buffer.

That way you are writing/reading large chunks of data from the disk.

Beyond that you need asynchronous I/O to read/write data while you are doing the sorting. But that's probably overkill.

Douglas Leeder
the standard file stream objects are already buffered. There is no need to do any manual buffering
Martin York
Could you guys explain a little more in detail how do I read/write into buffers while sorting at the same time?
Sunil
A: 

Since you're only doing a merge, not a complete sort, it's just the basic merge loop. Purely sequential I/O. No need to worry about buffers. Picture a zipper on a jacket. It's that simple. (Note: it could be a lot faster if the numbers are in binary format in the files. Not only will the files be smaller, but the program will be I/O limited, and the numbers will be perfectly accurate.)

double GetNumberFromFile(FILE file){
  if (feof(file)){
    return BIGBIGNUMBER;
  }
  else {
    return ReadADouble(file);
  }
}

double A = GetNumberFromFile(AFILE);
double B = GetNumberFromFile(BFILE);
while (A < BIGBIGNUMBER && B < BIGBIGNUMBER){
  if (A < B){
    write A;
    A = GetNumberFromFile(AFILE);
  }
  else if (B < A){
    write B;
    B = GetNumberFromFile(BFILE);
  }
  else {
    write A;
    write B; // or not, if you want to eliminate duplicates
    A = GetNumberFromFile(AFILE);
    B = GetNumberFromFile(BFILE);
  }
}
while (A < BIGBIGNUMBER){
    write A;
    A = GetNumberFromFile(AFILE);
}
while (B < BIGBIGNUMBER){
    write B;
    B = GetNumberFromFile(BFILE);
}

Responding to your question, consider a simpler problem, copying one file to another. You're only doing sequential I/O, which the file system is really good at. You write a simple loop to read small units like a byte or int from from file, and write it to the other. As soon as you try to read a byte, the system allocates a nice big buffer, swipes a big chunk of the file into the buffer, and then feeds you the byte out of the buffer. It keeps doing that until you need another buffer, when it invisibly gloms another one for you. The same sort of thing happens with the file you are writing. Now the CPU is pretty quick, so it can iterate through the input bytes, copying them to the output, in a fraction of the time it takes to read or write a buffer, because the reading or writing can't go any faster than the external hardware. The only reason a larger buffer would help is that part of the reading/writing time is what's called "latency", basically the time it takes to move the head to the desired track, and wait for the desired sector to come around. Most file systems break up the files into chunks that are sprinkled around the disk, so the head is jumping anyway. You can hear it.

The only difference between copying and a merge algorithm like yours is it's reading two files, not one. Either way, the basic time sequence is a series of buffer reads and writes interspersed with a small amount of CPU action. (It is possible to do overlapped I/O, so that the CPU action takes place while the I/O happens, so there is basically no delay between buffer reads and writes, but it was a bigger deal when CPUs were 1000 times slower.)

Of course, if you can arrange it so that the files being read and written are all on separate physical disk drives, and the drives are not fragmented much, then the amount of head motion could be minimized, and larger buffers might help. But basically, with a simple program, you can pretty much expect the simple code to go about as fast as the disk can move data, and giant buffers might help, but not much.

Mike Dunlavey
So you say there is no CPU involved? If comparison is a CPU task then wouldn't this make the CPU wait for I/O? i.e. Generally, CPU is much faster than I/O. In this program it looks like CPU has to wait until the I/O picks a single number by number and compare it, again go to wait until next two numbers appear. The better way I think would be if we read a chunk from each file say 100MB. Isn't that the case? Please correct me if I'm wrong. Thanks
Sunil
And also please look at the comment by me to dalle which would give an idea of what I'm trying to do exactly.
Sunil
@Sunil: The answer is buffering. You don't read one number at a time from disk, it is buffered in several layers, the drive, the driver, the OS, the standard library, and/or the application. The buffers on each level doesn't need to be large in order for it to be sufficient.
dalle
@Sunil: In other words, you have a great idea, buffering. So great, in fact, that it's already being done for you :-)
Mike Dunlavey
@dalle: I see your point but looks like all the buffers are abstracted from the user. Isn't there a way where we can control the buffer sizes allocated to the program?
Sunil
@Sunil: You can control the buffer size. I don't remember how. It's somewhere in the doc for std:: You can see for yourself if it makes a difference, and maybe post back here if it does.
Mike Dunlavey
@Sunil: Measure first, optimize later.
dalle
@all: Thanks for your posts. If I find anything new and useful, I'll post it back here.
Sunil
+3  A: 

Why not utilize the standard library?

#include <fstream>
#include <iterator>
#include <algorithm>

int main()
{
   std::ifstream in1("in1.txt");
   std::ifstream in2("in2.txt");
   std::ofstream ut("ut.txt");
   std::istream_iterator<int> in1_it(in1);
   std::istream_iterator<int> in2_it(in2);
   std::istream_iterator<int> in_end;
   std::ostream_iterator<int> ut_it(ut, "\n");

   std::merge(in1_it, in_end, in2_it, in_end, ut_it);
}
dalle
This is probably the easiest way but the main idea is to utilize only 100MB of memory. How to efficiently merge two 1GB files by using only 100MB of main memory/buffer size so that there is not a great deal of stalling of I/O and CPU?. The two ways that I know and was discussing is to use 50MB for each file and once any one of the 50 MB gets exhausted, read and refill it. Another way or the hard part is to how to continually reading the file and keep filling the buffer as I sort the file.
Sunil
@Sunil: This solution meets all your criteria.
Magnus Hoff
A: 

Benchmark. Read value-by-value and block read. Feel the difference! =)

tiredcoder