views:

167

answers:

4

I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).

Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.

Thanks!

+7  A: 

So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:

static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
    using (Stream output = File.OpenWrite(outputFile))
    {
        foreach (string inputFile in inputFiles)
        {
            using (Stream input = File.OpenRead(inputFile))
            {
                input.CopyTo(output);
            }
        }
    }
}

That's using the Stream.CopyTo method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:

private static void CopyStream(Stream input, Stream output)
{
    byte[] buffer = new byte[8192];
    int bytesRead;
    while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
    {
        output.Write(buffer, 0, bytesRead);
    }
}

There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.

EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.

Jon Skeet
I guess your answer to the question is no?
Marcus Johansson
@Marcus: I guess so... although I wasn't sure that the OP would have been comfortable writing the stream versions above.
Jon Skeet
Thank you Jon for the help! :) I didn't know about "CopyTo".
Martin
Great indeed to hear about `CopyTo`, now I can delete my answer ;-)
Abel
the Copystream method looks a lot like the implementation of CopyTo, is it on purpose ?
dada686
@dada686: It wasn't copied, but I'm not surprised if they're similar, given that they have exactly the same purpose and it's a pretty trivial bit of code.
Jon Skeet
Looking at the kernel level, it's likely that this isn't really the most efficient. You're spending quite a bit of time copying data in memory. Passing FILE_FLAG_NO_BUFFERING to the underlying CreateFile would prevent this.
MSalters
@MSalters: When you say "quite a bit of time" - isn't that likely to be massively dwarfed by the time spent doing the physical read? Using FileOptions.SequentialScan when creating the input streams may help, but I'd usually go for the simplest approach that worked until I found there to be an actual issue.
Jon Skeet
Actually, modern disks are becoming quite fast. This applies especially to RAID arrays and SSDs. Furthermore, it looks you'd have not one but two memory copies (to and from the unaligned buffer). By skipping that, you're probably not going to see double-digit performance increases, but 1-10% faster is likely.
MSalters
A: 

Why do you want to do this?

One way might be to fiddle with low level fragmentation, would be cool if you got it to work.

Here is a wrapper for C#.

http://blogs.msdn.com/b/jeffrey_wall/archive/2004/09/13/229137.aspx

Marcus Johansson
+2  A: 

Do it from the command line:

copy 1.txt+2.txt+3.txt combined.txt

or

copy *.txt combined.txt
gmagana
You do realize he said **125** files, right? That's going to be very long and tedious to type out. If you gave a C# program to generate the copy string, that might be a *partial* answer.
Aaronaught
Dude, then use the second option, with the file mask. Or do a dir command (ie, dir /b to get only filenames), capture the filenames to a file, and construct the command in a good text editor. There are _many_ ways to avoid typing 125 filenames.
gmagana
The point is, you didn't even come close to answering the question. You've made a ton of assumptions about the problem domain that you can't possibly know. It's fine to *ask* for more details about the domain but not to simply assume that the question author has chosen an incorrect way of resolving his problem. -1 for your possibly irrelevant solution and your argumentative tone, "dude."
Aaronaught
LOL, gotta love self-appointed mods. Chill out. You read too much into things (which is, coincidentally, what you accuse me of; talk about projecting yourself). The OP asked how to combine files, I gave an answer that works. It may fit the problem perfectly or it may not. OP knows if that's the case, _but you do not_. I'm not up for a pissing match though, so this is my last response to you.
gmagana
+1  A: 

Do you mean with merge that you want to decide with some custom logic what lines go where? Or do you mean that you mainly want to concatenate the files into one big one?

In the case of the latter, it is possible that you don't need to do this programmatically at all, just generate one batch file with this (/b is for binary, remove if not needed):

copy /b "file 1.tsv" + "file 2.tsv" "destination file.tsv"

Using C#, I'd take the following approach. Write a simple function that copies two streams:

void CopyStreamToStream(Stream dest, Stream src)
{
    int bytesRead;

    // experiment with the best buffer size, often 65536 is very performant
    byte[] buffer = new byte[GOOD_BUFFER_SIZE];

    // copy everything
    while((bytesRead = src.Read(buffer, 0, buffer.Length)) > 0)
    {
        dest.Write(buffer, 0, bytesRead);
    }
}

// then use as follows (do in a loop, don't forget to use using-blocks)
CopStreamtoStream(yourOutputStream, yourInputStream);
Abel
@Aaronaught: I was halfway when I submitted, then I wrote the second part. But also, note the little hint in the second para: *"just generate one batch file"*. By generating, I mean: create automatically. But then I decided to add the C# code :)
Abel