views:

305

answers:

4

RichCopy, a better-than-robocopy-with-GUI tool from Microsoft, seems to be the current tool of choice for copying files. One of it's main features, hightlighted in the TechNet article presenting the tool, is that it copies multiple files in parallel. In its default setting, three files are copied simultaneously, which you can see nicely in the GUI: [Progress: xx% of file A, yy% of file B, ...]. There are a lot of blog entries around praising this tool and claiming that this speeds up the copying process.

My question is: Why does this technique improve performance? As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network. My assumption would be that copying multiple files at once makes the whole process slower, since the HDD needs to jump back and forth between different files rather than just sequentially streaming one file. Since RichCopy is faster, there must be some mistake in my assumptions...

+5  A: 

The tool is making use improvements in hardware which can optimise multiple read and write requests much better.

When copying one file at a time the hardware isn't going to know that the block of data that currently is passing under the read head (or near by) will be needed of a subsquent read since the software hasn't queued that request yet.

A single file copy these days is not very taxing task for modern disk sub-systems. By giving these hardware systems more work to do at once the tool is leveraging its improved optimising features.

AnthonyWJones
+1  A: 

My gues is that the hdd read write heads spend most of their time idle and wait for the correct memory block of the disk to apear under them, the more memory being copied means less time in idle and most modern disk schedulers should take care of the jumping (for a low number of files/fragments)

josefx
+1  A: 

As far as I know, when copying files on modern computer systems, the HDD is the bottleneck, not the CPU or the network.

I think those assumptions are overly simplistic.

First, while LANs run at 100Mb / 1Gbit. Long haul networks have a maximum data rate that is less than the max rate of the slowest link.

Second, the effective throughput of TCP/IP stream over the internet is often dominated by the time taken to round-trip messages and acknowledgments. For example, I have a 8+Mbit link, but my data rate on downloads is rarely above 1-2Mbits per second when I'm downloading from the USA. So if you can run multiple streams in parallel one stream can be waiting for an acknowledgment while another is pumping packets. (But if you try to send too much, you start getting congestion, timeouts, back-off and lower overall transfer rates.)

Finally, operating systems are good at doing a variety of I/O tasks in parallel with other work. If you are downloading 2 or more files in parallel, the O/S may be reading / processing network packets for one download and writing to disc for another one ... at the same time.

Stephen C
+2  A: 

A naive "copy multiple files" application will copy one file, then wait for that to complete before copying the next one.

This will mean that an individual file CANNOT be copied faster than the network latency, even if it is empty (0 bytes). Because it probably does several file server calls, (open,write,close), this may be several x the latency.

To efficiently copy files, you want to have a server and client which use a sane protocol which has pipelining; that's to say - the client does NOT wait for the first file to be saved before sending the next, and indeed, several or many files may be "on the wire" at once.

Of course to do that would require a custom server not a SMB (or similar) file server. For example, rsync does this and is very good at copying large numbers of files despite being single threaded.

So my guess is that the multithreading helps because it is a work-around for the fact that the server doesn't support pipelining on a single session.

A single-threaded implementation which used a sensible protocol would be best in my opinion.

MarkR
Microsoft's file protocols transfer are very poorly 'designed'.Their implementations are still worse. My evidence of this is that SAMBA will outperform Windows on the same hardware. In parallel the copy delays caused by waiting for acknowledgments are mitigated by copying other files in the "dead time."
Tim Williscroft
My point was not that the protocol is badly designed; it's that its design doesn't lend itself to this particular use case.The protocol design is sufficient to implement the requirement to provide transparent remote file access; it just doesn't work too well for copying many small files over a link with latency - you need something else for that.
MarkR