ansaurus

Question

how to write super fast file streaming code in C# ?

Answer 1

+5 A:

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.

(edit) something like:

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}

Marc Gravell 2009-06-05 13:48:11

Any reason for the flush at the end? Closing it should do that. Also, I think you want to subtract from length in the first loop :)

Jon Skeet 2009-06-05 13:56:40

Good eyes Jon! The Flush was force of habit; from a lot of code when I pass streams in rather than open/close them in the method - it is convenient (if writing a non-trivial amount of data) to flush it before returning.

Marc Gravell 2009-06-05 14:12:52

Answer 2

+15 A:

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't.

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.

I think it'll be significantly faster, but obviously you'll need to try it to see...

This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

Jon Skeet 2009-06-05 13:49:02

Answer 3

+3 A:

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read:

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

schnaader 2009-06-05 13:49:50

Answer 4

+1 A:

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations?

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

JMarsch 2009-06-05 13:52:43

Answer 5

A:

(For future reference.)

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).

Memory Mapped files are supported in managed code in .NET 4.0.

But as noted, you need to profile, and expect to switch to native code for maximum performance.

Richard 2009-06-05 14:08:27

Answer 6

A:

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

TheSean 2009-06-05 14:21:59

Threading may help, but his bottleneck is surely on the I/O -- the CPU is probably spending a lot of time waiting on the disk. That's not to say that threading wouldn't make any difference (for example, if the writes are to different spindles, then he might get a better performance boost than he would if it were all on one disk)

JMarsch 2009-06-05 16:17:48

Answer 7

+2 A:

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

SpaceghostAli 2009-06-05 14:57:07

Answer 8

A:

using FileStream + StreamWriter I know it's possible to create massive files in little time (< 1 min 30 seconds). I generate 3 files totaling 700+ megabytes from one file using that technique.

Your primary problem with the code you're using is that you are opening a file every time that is creating file i/o overhead.

If you knew the names of the files you would be generating ahead of time you could extract the File.OpenWrite into a separate method, it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

mcauthorn 2009-06-05 15:31:38

ansaurus

tags:

views:

answers:

how to write super fast file streaming code in C# ?

related questions