Hi, Whats the best way to implement an N way merge for N sorted files?
Lets say I have 9 sorted files with 10 records each? How do I merge these files to create a big file with 90 sorted records?
Hi, Whats the best way to implement an N way merge for N sorted files?
Lets say I have 9 sorted files with 10 records each? How do I merge these files to create a big file with 90 sorted records?
Strategy might depend on amount of data.
Here's a code example which reads in N sorted text files and merges them. I have not included duplicate checking, but it should be easy to implement.
First a helper class.
class MergeFile : IEnumerator<string>
{
private readonly StreamReader _reader;
public MergeFile(string file)
{
_reader = File.OpenText(file);
Current = _reader.ReadLine();
}
public string Current { get; set; }
public void Dispose()
{
_reader.Close();
}
public bool MoveNext()
{
Current = _reader.ReadLine();
return Current != null;
}
public void Reset()
{
throw new NotImplementedException();
}
object IEnumerator.Current
{
get { return Current; }
}
}
And then code to read and merge (it should be refactored for clarity in production):
// Get the file names and instantiate our helper class
List<IEnumerator<string>> files = Directory.GetFiles(@"C:\temp\files", "*.txt").Select(file => new MergeFile(file)).Cast<IEnumerator<string>>().ToList();
List<string> result = new List<string>();
IEnumerator<string> next = null;
while (true)
{
bool done = true;
// loop over the helpers
foreach (var mergeFile in files)
{
done = false;
if (next == null || string.Compare(mergeFile.Current, next.Current) < 1)
{
next = mergeFile;
}
}
if (done) break;
result.Add(next.Current);
if (!next.MoveNext())
{
// file is exhausted, dispose and remove from list
next.Dispose();
files.Remove(next);
next = null;
}
}
I'm assuming that there could be a lot more data then you gave in your example. If you can open all the files simultaneously you can use this algorithm:
Note that you don't have to read all the files into memory at once, so this will work well if you have a reasonable number of large files, but not if you have a lot of small files.
If you have a lot of small files, you should merge them in groups to make a single output file for each group, then repeat the process to merge these new groups.
In C# you can use for example a SortedDictionary
to implement the priority queue.
Addressing the comments in the other answer:
If you have an variable number of files, here's what I'd do. This is just a sketch to get the idea across; this code doesn't compile, I've gotten the method names wrong, and so on.
// initialize the data structures
var priorityQueue = new SortedDictionary<Record, Stream>();
var streams = new List<Stream>();
var outStream = null;
try
{
// open the streams.
outStream = OpenOutputStream();
foreach(var filename in filenames)
streams.Add(GetFileStream(filename));
// initialize the priority queue
foreach(var stream in streams)
{
var record = ReadRecord(stream);
if (record != null)
priorityQueue.Add(record, stream);
// the main loop
while(!priorityQueue.IsEmpty)
{
var record = priorityQueue.Smallest;
var smallestStream = priorityQueue[record];
WriteRecord(record, outStream);
priorityQueue.Remove(record);
var newRecord = ReadRecord(smallestStream);
if (newRecord != null)
priorityQueue.Add(newRecord, smallestStream);
}
}
finally { clean up the streams }
Does that make sense? You just keep on grabbing the smallest thing out of the priority queue and replacing it with the next record in that stream, if there is one. Eventually the queue will be empty and you'll be done.