views:

1406

answers:

8

My problem is in regards file copying performance. We have a media management system that requires a lot of moving files around on the file system to different locations including windows shares on the same network, FTP sites, AmazonS3, etc. When we were all on one windows network we could get away with using System.IO.File.Copy(source, destination) to copy a file. Since many times all we have is an input Stream (like a MemoryStream), we tried abstracting the Copy operation to take an input Stream and an output Stream but we are seeing a massive performance decrease. Below is some code for copying a file to use as a discussion point.

public void Copy(System.IO.Stream inStream, string outputFilePath)
{
    int bufferSize = 1024 * 64;

    using (FileStream fileStream = new FileStream(outputFilePath, FileMode.OpenOrCreate, FileAccess.Write))
    {

        int bytesRead = -1;
        byte[] bytes = new byte[bufferSize];

        while ((bytesRead = inStream.Read(bytes, 0, bufferSize)) > 0)
        {
            fileStream.Write(bytes, 0, bytesRead);
            fileStream.Flush();
        }
    }
}

Does anyone know why this performs so much slower than File.Copy? Is there anything I can do to improve performance? Am I just going to have to put special logic in to see if I'm copying from one windows location to another--in which case I would just use File.Copy and in the other cases I'll use the streams?

Please let me know what you think and whether you need additional information. I have tried different buffer sizes and it seems like a 64k buffer size is optimal for our "small" files and 256k+ is a better buffer size for our "large" files--but in either case it performs much worse than File.Copy(). Thanks in advance!

A: 

One thing that stands out is that you are reading a chunk, writing that chunk, reading another chunk and so on.

Streaming operations are great candidates for multithreading. My guess is that File.Copy implements multithreading.

Try reading in one thread and writing in another thread. You will need to coordinate the threads so that the write thread doesn't start writing away a buffer until the read thread is done filling it up. You can solve this by having two buffers, one that is being read while the other is being written, and a flag that says which buffer is currently being used for which purpose.

Eric J.
I'm currently investigating multithreading. Are there are good open source projects that do exactly this perchance? I'll keep investigating. Thanks for the quick response.
jakejgordon
A: 

Try to remove the Flush call, and move it to be outside the loop.

Sometimes the OS knows best when to flush the IO.. It allows it to better use its internal buffers.

Aviad Ben Dov
I also don't think that the Copy operation involves multithreading and personally I would consider that a bad idea. It would mean creating a thread for every copy operation, which is supposedly even more costly than just using streams..
Aviad Ben Dov
@aviadbenov: Its true that creating our own threads to handle IO operations is overkill. However .NET maintains a pool of threads expressly for this purpose. Using asynchronous IO calls correctly allow us to utalise these threads without having to create and destroy them ourselves.
AnthonyWJones
@Anthony: What you're saying is true but also dangerous. If many threads would be copying files, the thread pool itself would become the bottle neck of the copying operation!
Aviad Ben Dov
+1  A: 

Here's a similar answer

http://stackoverflow.com/questions/230128/best-way-to-copy-between-two-stream-instances-c

Your main problem is the call to Flush(), that will bind your performance to the speed of the I/O.

sylvanaar
A: 

Three changes will dramatically improve performance:

  1. Increase your buffer size, try 1MB (well -just experiment)
  2. After you open your fileStream, call fileStream.SetLength(inStream.Length) to allocate the entire block on disk up front (only works if inStream is seekable)
  3. Remove fileStream.Flush() - it is redundant and probably has the single biggest impact on performance as it will block until the flush is complete. The stream will be flushed anyway on dispose.

This seemed about 3-4 times faster in the experiments I tried:

   public static void Copy(System.IO.Stream inStream, string outputFilePath)
    {
        int bufferSize = 1024 * 1024;

        using (FileStream fileStream = new FileStream(outputFilePath, FileMode.OpenOrCreate, FileAccess.Write))
        {
            fileStream.SetLength(inStream.Length);
            int bytesRead = -1;
            byte[] bytes = new byte[bufferSize];

            while ((bytesRead = inStream.Read(bytes, 0, bufferSize)) > 0)
            {
                fileStream.Write(bytes, 0, bytesRead);
            }
       }
    }
Rob Levine
+4  A: 

File.Copy was build around CopyFile Win32 function and this function takes lot of attention from MS crew (remember this Vista-related threads about slow copy performance).

Several clues to improve performance of your method:

  1. Like many said earlier remove Flush method from your cycle. You do not need it at all.
  2. Increasing buffer may help, but only on file-to-file operations, for network shares, or ftp servers this will slow down instead. 60 * 1024 is ideal for network shares, at least before vista. for ftp 32k will be enough in most cases.
  3. Help os by providing your caching strategy (in your case sequential reading and writing), use FileStream constructor override with FileOptions parameter (SequentalScan).
  4. You can speed up copying by using asynchronous pattern (especially useful for network-to-file cases), but do not use threads for this, instead use overlapped io (BeginRead, EndRead, BeginWrite, EndWrite in .net), and do not forget set Asynchronous option in FileStream constructor (see FileOptions)

Example of asynchronous copy pattern:

int Readed = 0;
IAsyncResult ReadResult;
IAsyncResult WriteResult;

ReadResult = sourceStream.BeginRead(ActiveBuffer, 0, ActiveBuffer.Length, null, null);
do
{
    Readed = sourceStream.EndRead(ReadResult);

    WriteResult = destStream.BeginWrite(ActiveBuffer, 0, Readed, null, null);
    WriteBuffer = ActiveBuffer;

    if (Readed > 0)
    {
      ReadResult = sourceStream.BeginRead(BackBuffer, 0, BackBuffer.Length, null, null);
      BackBuffer = Interlocked.Exchange(ref ActiveBuffer, BackBuffer);
    }

    destStream.EndWrite(WriteResult);
  }
  while (Readed > 0);
arbiter
A: 

Mark Russinovich would be the authority on this.

He wrote on his blog an entry Inside Vista SP1 File Copy Improvements which sums up the Windows state of the art through Vista SP1.

My semi-educated guess would be that File.Copy would be most robust over the greatest number of situations. Of course, that doesn't mean in some specific corner case, your own code might beat it...

lavinio
+2  A: 

Dusting off reflector we can see that File.Copy actually calls the Win32 API:

if (!Win32Native.CopyFile(fullPathInternal, dst, !overwrite))

Which resolves to

[DllImport("kernel32.dll", CharSet=CharSet.Auto, SetLastError=true)]
internal static extern bool CopyFile(string src, string dst, bool failIfExists);

And here is the documentation for CopyFile

Ed Swangren
+1  A: 

You'll never going to able to beat the operating system at doing something so fundemental with your own code, not even if you crafted it carefully in assembler.

If you need make sure that your operations occur with the best performance AND you want to mix and match various sources then you will need to create a type that describes the resource locations. You then create an API that has functions such as Copy that takes two such types and having examined the descriptions of both chooses the best performing copy mechanism. E.g., having determined that both locations are windows file locations you it would choose File.Copy OR if the source is windows file but the destination is to be HTTP POST it uses a WebRequest.

AnthonyWJones