views:

120

answers:

2

Following reading various questions on reading and writing Streams, all the various answers define something like this as the correct way to do it:

private void CopyStream(Stream input, Stream output)
{
   byte[] buffer = new byte[16 * 1024];
   int read;
   while ((read = input.Read(buffer, 0, buffer.Length)) > 0)
   {
      output.Write(buffer, 0, read);
   } 
}

Two questions:

Why read and write in these smaller chunks?

What is the significance of the buffer size used?

+3  A: 

You can generally always choose the size to read and write. However some values will be more optimal for particular architectures. What these are is, well, beyond my knowledge. I've always tended to stick to fgures that I'm familliar with such as 4K (the page size on the NT systems I used write drivers for). But've experimented in user mode with larger sizes and I've never come across any issues. I try to keep the number of IO calls low as possible.

My suggestion is that these days the chunk size is really only that inportant if its very samll (operation overhead vs. amount gained) or very large (IO system blocking and saturation).

I think for any particular case you should

  1. Minimise number of IO calls
  2. Alter this strategy if real performance is an issue.
Preet Sangha
+2  A: 

If you read a byte at a time, then every byte you call has the overhead of calling the function to read the byte, and additional overheads (for example, doing a fileposition += 1 to remember where in the file you are, checking if you have reached the end of the file, and so on)

If you read 4000 bytes, then you have the same overheads (in the above example, 1 function call, one add (fileposition += 4000), and one check to see if you are at the end of the file. So in terms of the overheads, you've just made it 4000 times faster. (In reality, there are other costs so you won't see that big a gain, but you have drastically cut the overheads)

Of course, you could create a buffer as big as the entire file, and get the absolute minimum overheads. However:

  • the file might be huge - bigger than the memory available to your program, so this would simply fail. Or it might be so big that you start to use virtual memory and this will drastically slow things down. So breaking it into smaller chunks means you can copy an unlimited amount of data by using a small fixed-size buffer

  • your OS and devices might be able to read and write data simultaneously (e.g. copying from one physical disk drive to another). If you read all the data before you write all the data, then you have to wait for the whole read before you can start writing. But in many cases, you may be able to be doing both operations in parallel - so read a small chunk and start it writing "asynchronously" (in the background) while you go back and read the next chunk.

  • You get diminishing returns. Reading 4 bytes instead of 1 may well be 4x faster. But reading 4,000, 40,000 or 400,000 will not speed things up (indeed, for the reasons above, larger buffers could actually slow things down).

  • In some cases, physical devices work with specific data sizes (e.g. 4096 bytes per sector, 128 bytes per cache line, or 1500 bytes per data packet, or 8 bytes (64 bits) over a CPU bus). Dividing data up into chunks that match (or are multiples of) the underlying transport/storage mechanism can help the hardware to process the data more efficiently.

Typically I/O buffers of between 4kB to 128kB work best for most situations, and you can tune these to the particular operation being performed, so there is no "perfect" size that fits all situations.

Note that in most I/O situations, there are many buffers being used. e.g. When copying data from a disk, (in simplistic terms) it is read from the disk to a read cache (buffer) in the hard drive, then sent over the interface cable to the computer's drive controller, which may also buffer the data. Then it may be transferred into RAM via an I/O buffer, where it is held until your program is ready to receive it (it will probably even be fetching the data before you ask for it, as it expects you to continue reading from the same file, and tries to buffer the data so you don't have to wait for it). Then you read it into your buffer and write it. Then it goes to another I/O buffer, is sent to the drive controller, passed on to the drive, and cached in a write cache. Eventually the hard drive will decide to actually store the data in its write cache, and your copy will be completed - most of this happens in the background, so it may not finish being written until many seconds after your program thinks it has finished writing and has gone on to another task. (This is why you have to "safely remove" USB drives before unplugging them - the OS may not have actually written all the data to the device yet, even many seconds after the computer said your copy operation was finished)

Jason Williams