views:

4986

answers:

8

I have to split huge file into many smaller files. each of the destination file is defined by offset and length as number of bytes. I'm using the following code:

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.

  1. Is there a way to make the Writer connected directly to the Reader? (i.e. without actually loading the contents into the Buffer in memory)
+5  A: 

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.

(edit) something like:

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}
Marc Gravell
Any reason for the flush at the end? Closing it should do that. Also, I think you want to subtract from length in the first loop :)
Jon Skeet
Good eyes Jon! The Flush was force of habit; from a lot of code when I pass streams in rather than open/close them in the method - it is convenient (if writing a non-trivial amount of data) to flush it before returning.
Marc Gravell
+15  A: 

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't.

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.

I think it'll be significantly faster, but obviously you'll need to try it to see...

This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

Jon Skeet
+3  A: 

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read:

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

schnaader
+1  A: 

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations?

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

JMarsch
A: 

(For future reference.)

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).

Memory Mapped files are supported in managed code in .NET 4.0.

But as noted, you need to profile, and expect to switch to native code for maximum performance.

Richard
A: 

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

TheSean
Threading may help, but his bottleneck is surely on the I/O -- the CPU is probably spending a lot of time waiting on the disk. That's not to say that threading wouldn't make any difference (for example, if the writes are to different spindles, then he might get a better performance boost than he would if it were all on one disk)
JMarsch
+2  A: 

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

SpaceghostAli
A: 

using FileStream + StreamWriter I know it's possible to create massive files in little time (< 1 min 30 seconds). I generate 3 files totaling 700+ megabytes from one file using that technique.

Your primary problem with the code you're using is that you are opening a file every time that is creating file i/o overhead.

If you knew the names of the files you would be generating ahead of time you could extract the File.OpenWrite into a separate method, it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

mcauthorn