views:

1799

answers:

7

Hi Techies,

I recently gave a interview to one of the TOP software company. I was completely stuck with only one question asked by interviewer to me, which was

Q. I have a machine with 512 mb / 1 GB RAM and I have to sort a file (XML, or any) of 4 GB size. How will I proceed? What will be the data structure, and which sorting algorithm will I use and how?

Do you think it is achievable? If yes then can you please explain?

Thanks in advance!

+2  A: 

There is a nice post on the Guido van Rossum blog which has something to suggest. Beware that the code is in Python.

Fabrizio C.
Hi Fabrizio,Thanks for your reply, the Guido van Rossum blog is very near to my question but still will be great if you can give more reference.Thanks !!
Jon K
I'm afraid I can't do much more that this. Anyway I think you can get "the rationale" even from other answers: they are all saying the same thing IMHO.
Fabrizio C.
+3  A: 

Use Divide and Conquer.

Here's the pseudocode:

function sortFile(file)
    if fileTooBigForMemory(file)
       pair<firstHalfOfFile, secondHalfOfFile> = breakIntoTwoHalves()
       sortFile(firstHalfOfFile)
       sortFile(secondHalfOfFile)
    else
       sortCharactersInFile(file)
    endif

    MergeTwoHalvesInOrder(firstHalfOfFile, secondHalfOfFile)
end

Two well-known algorithms that fall in to the divide and conquer category are merge sort and quick sort algorithm. So you could use them for implementation.

As for the data structure, a char array containing characters in the file could do. If you want to be more object oriented, wrap it in a class called File:

class File {
    private char[] characters;
    //methods to access and mutate 'characters'
}
Frederick
This is also called multi-way merge in some textbooks.
Jason Short
+2  A: 

Split your file to chunks which fit into memory. Sort each chunk using quick sort and save it to a separate file. Then merge result files and you get your result.

Konstantin Spirin
I know it just reverses, I'm just showing him d _proof of concept_ of how 2 make a file system "appear" as memory 2 program, so he can *seamlessly plug* any array sorting he likes. Undo ur voting down, it's uncalled 4. Perhaps dis will help: http://letmegooglethatforyou.com/?q=c%23+array+quicksort
Michael Buen
+7  A: 

The answer the interviewer might want maybe how you manage to efficiently sort the data set which exceeds system memory.The following section is taken from Wikipedia:

Memory usage patterns and index sorting

When the size of the array to be sorted approaches or exceeds the available primary memory, so that (much slower) disk or swap space must be employed, the memory usage pattern of a sorting algorithm becomes important, and an algorithm that might have been fairly efficient when the array fit easily in RAM may become impractical. In this scenario, the total number of comparisons becomes (relatively) less important, and the number of times sections of memory must be copied or swapped to and from the disk can dominate the performance characteristics of an algorithm. Thus, the number of passes and the localization of comparisons can be more important than the raw number of comparisons, since comparisons of nearby elements to one another happen at system bus speed (or, with caching, even at CPU speed), which, compared to disk speed, is virtually instantaneous.

For example, the popular recursive quicksort algorithm provides quite reasonable performance with adequate RAM, but due to the recursive way that it copies portions of the array it becomes much less practical when the array does not fit in RAM, because it may cause a number of slow copy or move operations to and from disk. In that scenario, another algorithm may be preferable even if it requires more total comparisons.

One way to work around this problem, which works well when complex records (such as in a relational database) are being sorted by a relatively small key field, is to create an index into the array and then sort the index, rather than the entire array. (A sorted version of the entire array can then be produced with one pass, reading from the index, but often even that is unnecessary, as having the sorted index is adequate.) Because the index is much smaller than the entire array, it may fit easily in memory where the entire array would not, effectively eliminating the disk-swapping problem. This procedure is sometimes called "tag sort".[5]

Another technique for overcoming the memory-size problem is to combine two algorithms in a way that takes advantages of the strength of each to improve overall performance. For instance, the array might be subdivided into chunks of a size that will fit easily in RAM (say, a few thousand elements), the chunks sorted using an efficient algorithm (such as quicksort or heapsort), and the results merged as per mergesort. This is less efficient than just doing mergesort in the first place, but it requires less physical RAM (to be practical) than a full quicksort on the whole array.

Techniques can also be combined. For sorting very large sets of data that vastly exceed system memory, even the index may need to be sorted using an algorithm or combination of algorithms designed to perform reasonably with virtual memory, i.e., to reduce the amount of swapping required.

m3rLinEz
A: 

Just simulate a virtual memory, overload the array index operator, []

Find a quicksort implementation that sorts an array in C++ or C#. overload the indexer operator [] which will read from and save to file. That way, you can just plug existing sort algorithms, you just change what happens behind the scenes on those []

Michael Buen
Thanks Michael.. for the response, can we read from a file without opening it ?? and if we open it then its size can go over the RAM size ?? can you please clarify.
Jon K
Opening a file won't load its entire contents in memory if you use ReadLineOn implementation part, ReadLines the XML/text file to a binary file. Using the array sort algorithm you like , do the read/sort/write on the binary file as facilitated by overloaded [] operator. Then convert it back to XML
Michael Buen
Opening a file won't load its entire contents in memory if you use ReadLine. On implementation part, ReadLines the XML/text file to a binary file. Using the array sort algorithm you like, do the read/sort/write on the binary file as facilitated by overloaded [] operator. Then convert it back to XM
Michael Buen
A: 

here's one example of simulating virtual memory on C#

source: http://msdn.microsoft.com/en-us/library/aa288465(VS.71).aspx

// indexer.cs
// arguments: indexer.txt
using System;
using System.IO;

// Class to provide access to a large file
// as if it were a byte array.
public class FileByteArray
{
 Stream stream;      // Holds the underlying stream
      // used to access the file.
// Create a new FileByteArray encapsulating a particular file.
 public FileByteArray(string fileName)
 {
  stream = new FileStream(fileName, FileMode.Open);
 }

 // Close the stream. This should be the last thing done
 // when you are finished.
 public void Close()
 {
  stream.Close();
  stream = null;
 }

 // Indexer to provide read/write access to the file.
 public byte this[long index]   // long is a 64-bit integer
 {
  // Read one byte at offset index and return it.
  get 
  {
   byte[] buffer = new byte[1];
   stream.Seek(index, SeekOrigin.Begin);
   stream.Read(buffer, 0, 1);
   return buffer[0];
  }
  // Write one byte at offset index and return it.
  set 
  {
   byte[] buffer = new byte[1] {value};
   stream.Seek(index, SeekOrigin.Begin);
   stream.Write(buffer, 0, 1);
  }
 }

 // Get the total length of the file.
 public long Length 
 {
  get 
  {
   return stream.Seek(0, SeekOrigin.End);
  }
 }
}

// Demonstrate the FileByteArray class.
// Reverses the bytes in a file.
public class Reverse 
{
 public static void Main(String[] args) 
 {
  // Check for arguments.
  if (args.Length == 0)
  {
   Console.WriteLine("indexer <filename>");
   return;
  }

  FileByteArray file = new FileByteArray(args[0]);
  long len = file.Length;

  // Swap bytes in the file to reverse it.
  for (long i = 0; i < len / 2; ++i) 
  {
   byte t;

   // Note that indexing the "file" variable invokes the
   // indexer on the FileByteStream class, which reads
   // and writes the bytes in the file.
   t = file[i];
   file[i] = file[len - i - 1];
   file[len - i - 1] = t;
  }

  file.Close();
 } 
}

Use the above code to roll your own array class. Then just use any array sorting algorithms.

Michael Buen
Thanks Michael, I will try your solution. But I will mark your response as correct answer.Cheer :)
Jon K
This code just reverses the file.It does not solve your problem and is extremely inefficient in terms of speed.
Konstantin Spirin
I know it just reverses, I'm just showing him d _proof of concept_ of how 2 make a file system "appear" as memory 2 program, so he can *seamlessly plug* any array sorting he likes. Undo ur voting down, it's uncalled 4. Perhaps dis will help: http://letmegooglethatforyou.com/?q=c%23+array+quicksort
Michael Buen
Michael, I can't undo voting down - SO says the vote is too old and can only be changed if the post changes. Anyway, solution you are proposing is not reasonable and any "TOP software company" would kick off the guy who solves offered problem in such way.
Konstantin Spirin
It's reasonable, if you'll employ some form of caching on it, any self-respecting programmer would optimize the code they deemed slow. Case in point, I found that VirtualMode DataGridView is best for huge data, but I will not use it in a dumbly way, I'll put some caching for CellValueNeeded.
Michael Buen
+1  A: 

I would use a multiway merge. There is an excellent book called Managing Gigabytes that shows several different ways of doing it. They also go into sort based inversion for files that are larger than physical memory. Look around page 240 for a pretty detailed algorithm on sorting through chunks on disk.

The post above is correct in that you split the file and sort each portion.

Say you have the 4GB file and only want to load a max of 512MB. That means you need to split the file into 8 chunks minimum. If you are not sure how much extra overhead your sort is going to use, you might even double that number to be safe to 16 chunks.

The 16 files are then sorted one at a time to be in a guaranteed order. So now you have chunk 0-15 as sorted files.

Now you open 16 file handles to those files and read one entry at a time, writing the lowest one to the final output. Since you know each of the files is already sorted, taking the lowest from each means you are then writing them in the correct order to the final output.

I have used such a system in C# for sorting large collections of spam words from emails. The original system required all of them to load into RAM in order to sort them and build a dictionary for spam counts. Once the file grew over 2 GB the in memory structures were requiring 6+GB of RAM and taking over 24 hours to sort due to paging and VM. The new system using the chunking above sorted the entire file in under 40 minutes. That was an impressive speedup for such a simple change.

I played with various load options (1/4 system memory per chunk, etc). It turned out that for our situation the best option was about 1/10 system memory. Then Windows had enough memory left over for decent File I/O buffering to offset the increased file traffic. And the machine was left very responsive to other processes running on it.

And yes, I do frequently like to ask these types of questions in interviews as well. Just to see if people can think outside the box. What do you do when you can't just use .Sort() on a list?

Jason Short