ansaurus

Question

Answer 1

A:

Strategy might depend on amount of data.

If data will fit in memory you can read all data into a list, sort it, and write it out
If you want to remove duplicates use a HashSet instead of a list
If it will not fit in memory, open all files for reading, compare the first record of each file, and write out the lowest. Then advance the file which you read. Loop over all files until they are all exhausted and written to the new file.
If you want to remove duplicates, do like above, but skip any record equal to the last written.

Here's a code example which reads in N sorted text files and merges them. I have not included duplicate checking, but it should be easy to implement.

First a helper class.

class MergeFile : IEnumerator<string>
{
    private readonly StreamReader _reader;

    public MergeFile(string file)
    {
        _reader = File.OpenText(file);
        Current = _reader.ReadLine();
    }

    public string Current { get; set; }

    public void Dispose()
    {
        _reader.Close();
    }

    public bool MoveNext()
    {
        Current = _reader.ReadLine();
        return Current != null;
    }

    public void Reset()
    {
        throw new NotImplementedException();
    }

    object IEnumerator.Current
    {
        get { return Current; }
    }
}

And then code to read and merge (it should be refactored for clarity in production):

// Get the file names and instantiate our helper class
List<IEnumerator<string>> files = Directory.GetFiles(@"C:\temp\files", "*.txt").Select(file => new MergeFile(file)).Cast<IEnumerator<string>>().ToList();
List<string> result = new List<string>();
IEnumerator<string> next = null;
while (true)
{
    bool done = true;
    // loop over the helpers
    foreach (var mergeFile in files)
    {
        done = false;
        if (next == null || string.Compare(mergeFile.Current, next.Current) < 1)
        {
            next = mergeFile;
        }
    }
    if (done) break;
    result.Add(next.Current);
    if (!next.MoveNext())
    {
        // file is exhausted, dispose and remove from list
        next.Dispose();
        files.Remove(next);
        next = null;
    }
}

Mikael Svenson 2010-02-18 17:10:34

Thanks, please see my comment above.

2010-02-18 18:09:44

I added a code sample to show merging of text files.

Mikael Svenson 2010-02-18 19:29:02

Answer 2

+6 A:

I'm assuming that there could be a lot more data then you gave in your example. If you can open all the files simultaneously you can use this algorithm:

Read the first line from each file, so you have 10 lines in memory, one from each file.
Put the lines into a priority queue by sort order.
Take the least element (sorted first) out of the priority queue and write to the output file.
Read one more line from the corresponding file the line came from and put that into the priority queue.
Repeat until all files are read to the end.

Note that you don't have to read all the files into memory at once, so this will work well if you have a reasonable number of large files, but not if you have a lot of small files.

If you have a lot of small files, you should merge them in groups to make a single output file for each group, then repeat the process to merge these new groups.

In C# you can use for example a SortedDictionary to implement the priority queue.

Mark Byers 2010-02-18 17:10:58

If you are reading one line at a time, wouldn't there be significant disk overhead switching back and forth between file sectors? It would seem reading in a buffer of data for each file would be an important factor

tbischel 2010-02-18 17:27:44

Hey, thanks for the quick responseThis is the algorithm I was planning to use.So here is the next questionI have a list that contains the temp filenames in my example 9 filenames.But this number can be different each time depending on the data in the original file and the memory specified by the user.How can I have a varying number of open streams depending on the number of sorted files I created from the original file?

2010-02-18 18:05:22

@user262102: Make a List<Stream>. Add streams to the list. Use foreach loop to iterate over the list of streams. Don't forget to close all the streams when you're done with them.

Eric Lippert 2010-02-18 18:35:58

@tbischel: modern disk controllers have big caches and a lot of intelligence in them. I wouldn't worry about it unless real-world testing showed it was a problem.

Eric Lippert 2010-02-18 18:36:58

Thanks Eric, that helped.What would be the best way to track what file the lowest record came from to start the process?

2010-02-18 19:58:24

@iser262102: the suggestion of using a sorted dictionary as a priority queue is a good one. You could use the dictionary as a map from records to the stream that produced the record. I'll draw a sketch of it.

Eric Lippert 2010-02-18 20:38:25

Hey Eric, I finally did it...:) I really appreciate your help. Performance is not that great but works well for my purpose. 71 seconds for 185 MB file.Basically, I am a QA and was writing a generic compares tool. The compares are blazing fast but they run of out of memory and hence this additional step for bigger files.Thanks again.

2010-02-19 17:02:46

Answer 3

+2 A:

Addressing the comments in the other answer:

If you have an variable number of files, here's what I'd do. This is just a sketch to get the idea across; this code doesn't compile, I've gotten the method names wrong, and so on.

// initialize the data structures
var priorityQueue = new SortedDictionary<Record, Stream>();
var streams = new List<Stream>();
var outStream = null; 
try
{
  // open the streams.
  outStream = OpenOutputStream();
  foreach(var filename in filenames)
    streams.Add(GetFileStream(filename));
  // initialize the priority queue
  foreach(var stream in streams)
  {
    var record = ReadRecord(stream);
    if (record != null)
      priorityQueue.Add(record, stream);
  // the main loop
  while(!priorityQueue.IsEmpty)
  {
     var record = priorityQueue.Smallest;
     var smallestStream = priorityQueue[record];
     WriteRecord(record, outStream);
     priorityQueue.Remove(record);
     var newRecord = ReadRecord(smallestStream);
     if (newRecord != null)
       priorityQueue.Add(newRecord, smallestStream);
  }
}
finally { clean up the streams }

Does that make sense? You just keep on grabbing the smallest thing out of the priority queue and replacing it with the next record in that stream, if there is one. Eventually the queue will be empty and you'll be done.

Eric Lippert 2010-02-18 20:48:55

One problem is my record is a string array and I can't use that as a key for the dictionary.I need to do it that way, because I parse the csv file to preserve value in each field and depending on the columns provided by the user as keys, I figure out the smallest record using quicksort.Hope it is clear, so I am unable to use th algorithm above.Any other ideas?

2010-02-19 01:25:05

@user262102: Create a comparer object that implements that logic and pass it as the ordering function to the sorted dictionary.

Eric Lippert 2010-02-19 06:34:27

ansaurus

tags:

views:

answers:

C# N way merge for external sort

related questions