views:

1337

answers:

10

Hi, I've been trying to find out the fastest way to code a file copy routine to copy a large file onto a RAID 5 hardware.

The average file size is around 2 GB.

There are 2 windows boxes (both running win2k3). The first box is the source, where is the large file is located. And the second box has a RAID 5 storage.

http://blogs.technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx

The above link clearly explains why windows copy, robocopy and other common copy utilities suffer in write performance. Hence, i've written a C/C++ program that uses CreateFile, ReadFile & WriteFile API's with NO_BUFFERING & WRITE_THROUGH flags. The program simulates ESEUTIL.exe, in the sense, it uses 2 threads, one for reading and one for writing. The reader thread reads 256 KB from source and fills a buffer. Once 16 such 256 KB blocks are filled, the writer thread writes the contents in the buffer to the destination file. As you can see, the writer thread writes 8MB of data in 1 shot. The program allocates 32 such 8MB blocks... hence, the writing and reading can happen in parallel. Details of ESEUtil.exe can be found in the above link. Note: I am taking care of the data alignment issues when using NO_BUFFERING.

I used bench marking utilities like ATTO and found out that our RAID 5 hardware has a write speed of 44MB per second when writing 8MB data chunk. Which is around 2.57 GB per minute.

But my program is able to achieve only 1.4 GB per minute.

Can anyone please help me identify what the problem is? Are there faster API's other that CreateFile, ReadFile, WriteFile available?

A: 

How fast can you read the source file if you don't write the destination?

Is the source file fragmented? Fragmented reads can be an order of magnitude slower than contiguous reads. You can use the "contig" utility to make it contiguous:

http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx

How fast is the network connecting the two machines?

Have you tried just writing dummy data, without reading it first, like ATTO does?

Do you have more than one read or write request in flight at a time?

What's the stripe size of your RAID-5 array? Writing a full stripe at a time is the fastest way to write to RAID-5.

RickNZ
A: 

If write speed is that important, why not consider RAID 0 for your hardware configuration?

Paul Sasik
A: 

Just remember that a hard disk buffers data coming from the platters and going to the platters. Most disk drives will try to optimize the read requests to keep the platters rotating and minimize head movement. The drives try to absorb as much data from the Host before writing to the platters so that the Host can be disconnected as soon as possible.

Your performance also depends on the I/O bus traffic on the PC as well as the traffic between the disk and the host. There are other alternative factors to consider such as system tasks and programs running "at the same time". You may not be able to achieve the exact performance as your measuring tool. And remember that these timings have a error factor due to the above mentioned overheads.

If your platform has DMA controllers, try using these.

Thomas Matthews
A: 

How fast can you read the source file if you don't write the destination? - The source is not a RAID machine. Oh i totally missed this point. I think, i'll benchmark the source using ATTO, for it could actually be the bottleneck.

Is the source file fragmented? Fragmented reads can be an order of magnitude slower than contiguous reads. You can use the "contig" utility to make it contiguous: - Okay, I'll use "contig" and perform the test again.

How fast is the network connecting the two machines? - 1 Gbps network. Sorry, i forgot to mention this in the initial question.

Have you tried just writing dummy data, without reading it first, like ATTO does? - No, i haven't done that... Actually, i gave it a thought myself. but never tried it. I think, I'll try that...

Do you have more than one read or write request in flight at a time? - Didn't understand this question. Are u referring to multi-threaded reads/writes ? - The program (that I've written) runs in the target machine. - Reader Thread reads 256 KB chunk from source onto a buffer. - Writer Thread waits until 32 such 256 blocks are filled (8MB). - When the writer gets the signal, it writes the 8 MB of data in a chunk. - The program allocates 16 such 8 MB blocks in the beginning. - In majority of the cases reads would be faster, and hence, the writer will never be blocked. - The buffer used is circular. And synchronization is in place. - We've tested with diff chunk size (64, 128, 256), chunks per block (8, 16, 32, 64) and Number of blocks (8, 16, 32, 64, 128). Its turns out on the current RAID 5 hardware, 256 KB chunk read from source, 8MB write @ target is the fastest.

What's the stripe size of your RAID-5 array? Writing a full stripe at a time is the fastest way to write to RAID-5. - How can i find this out ? Actually, the RAID was configured by the help-desk engineers. But i think its a 4 disk RAID 5. With 1 disk as hot-spare.

ring0
I meant, do you start one read or write before the last one completes? If so, be careful about causing disk seeks; disk I/O is fastest when it's linear. Regarding stripe size: RAID 5 operates in two modes: small writes, and large writes. Small writes are less than the stripe size; in that case, the parity block has to be read first, before it's written. With large writes, which are the same size as the stripe, the parity block is just written, without being read first, so they are much faster. The stripe size is the strip size times the number of disks minus one (set at config time).
RickNZ
The reads and writes happen in 2 threads. But, all reads happen sequentially and write happen sequentially as well...The reader fills up a buffer by sequentially reading from the source file. The writer writes the data into a pre-extended target file. The writes are sequential as well.how can i find out the strip size ?
ring0
A: 

If write speed is that important, why not consider RAID 0 for your hardware configuration?

  • The customer wants RAID 5.
  • Preferred over RAID 0 because of better fault tolerance.
  • The customer is satisfied with what RAID 5 can offer. The question here is benchmarking the hardware using ATTO shows a write speed of 2.57 GB per minute (8MB chunk write), why cant a copy tool achieve close to it ? Something like 2 GB per min is what we are looking at. We've been able to achieve only ~1.5 GB per min so far.
ring0
A: 

The right way to do this is with un-buffered fully asynchronous I/O. You will want to issue multiple I/Os to keep a queue going. This lets the file system, driver, and Raid-5 sub-system more optimally mange the I/Os.

You can also open multiple files and issue read and wites to multiple files.

NOTE! The optimal number of outstanding I/Os and how you interleave reads and writes will depend greatly on the storage sub-system itself. Your program will need to be highly paramterized so you can tune it.

Note - I belive that Robocopy has been improved - have you tried it? I

Foredecker
One goal is to minimize disk head movement. If you open multiple files or have non-sequential I/O requests in flight, disk head seeks can cause I/O throughput to drop by an order of magnitude or more.
RickNZ
Actually thats not true at all. Opening several files first gets that i/O out of the way. it also lets the file system load the directory entities more efficiently. The OS will generally do better when it has a queue of I/O requests to work with.
Foredecker
A: 

First, did you compare it with baseline values, "creating an empty 2GB file", and "just reading a 2GB file"? Are you sure the problem is with "copying" but not "writing" or "reading" in general?

Assuming you already did your benchmarks, I suggest you to stop buffering (think about why you disabled OS buffering in the first place) and start using overlapped I/O for writing with alternating read buffers. Pseudo-code is like this:

buf1 = alloc(BUFSIZE);
buf2 = alloc(BUFSIZE);
buf = buf1;
h_over = NULL;
while(!eof) {
  read(buf, BUFSIZE);
  if(h_over) waitfor(h_over);
  h_over = write_overlapped(buf, BUFSIZE);
  buf = buf == buf1 ? buf2 : buf1;
}

Here are the key points:

  • You do not bite more than you can chew. If writer is slower than the reader, it's pointless to do buffering in large sizes (such as 2GB in your example)

  • It still make sense to start reading while writing though. Overlapped I/O beautifully takes care of it. Another thread works too but synchronization is a chore and unnecessarily complicated.

ssg
+3  A: 

You should use async IO to get the best performance. That is opening the file with FILE_FLAG_OVERLAPPED and using the LPOVERLAPPED argument of WriteFile. You may or may not get better performance with FILE_FLAG_NO_BUFFERING. You will have to test to see.

FILE_FLAG_NO_BUFFERING will generally give you more consistent speeds and better streaming behavior, and it avoids polluting your disk cache with data that you may not need again, but it isn't necessarily faster overall.

You should also test to see what the best size is for each block of IO. In my experience There is a huge performance difference between copying a file 4k at a time and copying it 1Mb at a time.

In my past testing of this (a few years ago) I found that block sizes below about 64kB were dominated by overhead, and total throughput continued to improve with larger block sizes up to about 512KB. I wouldn't be surprised if with today's drives you needed to use block sizes larger than 1MB to get maximum throughput.

The numbers you are currently using appear to be reasonable, but may not be optimal. Also I'm fairly certain that FILE_FLAG_WRITE_THROUGH prevents the use of the on-disk cache and thus will cost you a fair bit of performance.

You need to also be aware that copying files using CreateFile/WriteFile will not copy metadata such as timestamps or alternate data streams on NTFS. You will have to deal with these things on your own.

Actually replacing CopyFile with your own code is quite a lot of work.

Addendum:

I should probably mention that when I tried this with software Raid 0 on WindowsNT 3.0 (about 10 years ago). The speed was VERY sensitive to the alignment in memory of the buffers. It turned out that at the time, the SCSI drivers had to use a special algorithm for doing DMA from a scatter/gather list, when the DMA was more than 16 physical regions of memory (64Kb). To get guranteed optimal performance required physically contiguous allocations - which is something that only drivers can request. This was basically a workaround for a bug in the DMA controller of a popular chipset back then, and is unlikely to still be an issue.

BUT - I would still strongly suggest that you test ALL power of 2 block sizes from 32kb to 32Mb to see which is faster. And you might consider testing to see if some buffers are consistently faster than others - it's not unheard of.

John Knoeller
+1. For asynchronous I/O.
wj32
I havent tried asynchronous IO yet. I have to try it.BTW, after some read/write test i figured out that 256 KB reads and 8MB writes were yielding max throughput. I wrote a reader program to check the read speed and used ATTO and another custom written program to test the write throughput.
ring0
A: 

A while back I wrote a blog posting about async file I/O and how it often tends to actually end up being synchronous unless you do everything just right (http://www.lenholgate.com/archives/000765.html).

The key points are that even when you're using FILE_FLAG_OVERLAPPED and FILE_FLAG_NO_BUFFERING you still need to pre-extend the file so that your async writes don't need to extend the file as they go; for security reasons file extension is always synchronous. To pre-extend you need to do the following:

  • Enable the SE_MANAGE_VOLUME_NAME privilege.
  • Open the file.
  • Seek to the desired file length with SetFilePointerEx().
  • Set the end of file with SetEndOfFile().
  • Set the end of the valid data within the file SetFileValidData().
  • Close the file.

Then...

  • Open the file to write.
  • Issue the writes
Len Holgate
Actually i am pre-extending the file. As i am using NO_BUFFERING, I'm taking care of the data alignment issues. For e.g, To copy a 1027 KB file. 1) I create a target file with 1024 KB. Using SetFilePointerEx and SetEndOfFile. 1024 KB because of alignment considerations. 2) Start the copy. 3) After 1024 KB is copied, i close the target file handle, re-open it without using the NO_BUFFERING flag, seek to the appropriate offset using SetFilePointerEx and then issue the WriteFile which would automatically grow the file to 1027KB.I havent read your blog yet. I'll do that and get back to you.
ring0
A: 

I did some tests and have some results. The tests were performed on 100Mbps & 1Gbps NIC. The source machine is Win2K3 server (SATA) and the target machine is Win2k3 server (RAID 5).

I ran 3 tests:

1) Network Reader -> This program just reads files across the network. The purpose of the program is to find the maximum n/w read speed. I am performing a NON BUFFERED reads using CreateFile & ReadFile.

2) Disk Writer -> This program benchmarks the RAID 5 speed by writing data. NON BUFFERED writes are performed using CreateFile & WriteFile.

3) Blitz Copy -> This program is the file copy engine. It copies files across the network. The logic of this program was discussed in the initial question. I am using synchronous I/O with NO_BUFFERING Reads & Writes. The APIs used are CreateFile, ReadFile & WriteFile.


Below are the results:

NETWORK READER:-

100 Mbps NIC

Took 148344 ms to read 768 MB with chunk size 8 KB.

Took 89359 ms to read 768 MB with chunk size 64 KB

Took 82625 ms to read 768 MB with chunk size 128 KB

Took 79594 ms to read 768 MB with chunk size 256 KB

Took 78687 ms to read 768 MB with chunk size 512 KB

Took 79078 ms to read 768 MB with chunk size 1024 KB

Took 78594 ms to read 768 MB with chunk size 2048 KB

Took 78406 ms to read 768 MB with chunk size 4096 KB

Took 78281 ms to read 768 MB with chunk size 8192 KB

1 Gbps NIC

Took 206203 ms to read 5120 MB (5GB) with chunk size 8 KB

Took 77860 ms to read 5120 MB with chunk size 64 KB

Took 74531 ms to read 5120 MB with chunk size 128 KB

Took 68656 ms to read 5120 MB with chunk size 256 KB

Took 64922 ms to read 5120 MB with chunk size 512 KB

Took 66312 ms to read 5120 MB with chunk size 1024 KB

Took 68688 ms to read 5120 MB with chunk size 2048 KB

Took 64922 ms to read 5120 MB with chunk size 4096 KB

Took 66047 ms to read 5120 MB with chunk size 8192 KB

DISK WRITER:-

Write performed on RAID 5 With NO_BUFFERING & WRITE_THROUGH

Writing 2048MB (2GB) of data with chunk size 4MB took 68328ms.

Writing 2048MB of data with chunk size 8MB took 55985ms.

Writing 2048MB of data with chunk size 16MB took 49569ms.

Writing 2048MB of data with chunk size 32MB took 47281ms.

Write performed on RAID 5 With NO_BUFFERING only

Writing 2048MB (2GB) of data with chunk size 4MB took 57484ms.

Writing 2048MB of data with chunk size 8MB took 52594ms.

Writing 2048MB of data with chunk size 16MB took 49125ms.

Writing 2048MB of data with chunk size 32MB took 46360ms.

*Write performance degrades linearly as the chunk size reduces. And WRITE_THROUGH flag introduces some performance hit*

BLITZ COPY:-

1 Gbps NIC, Copying 60 GB of files with NO_BUFFERING

Time Taken to complete copy : 2236735 ms. Ie, 37.2 mins. The speed is ~ 97 GB / per.

100 Mbps NIC, Copying 60 GB of files with NO_BUFFERING

Time Taken to complete copy : 7337219 ms. Ie, 122 mins. The speed is ~ 30 GB / per.

I did try using 10-FileCopy program by Jeffrey Ritcher that uses Async-IO with NO_BUFFERING. But, the results were poor. I guess the reason could be the chunk size is 256 KB... 256 KB write on RAID 5 is terribly slow.

Comparing with robocopy:

100 Mbps NIC : Blitz Copy and robocopy perform @ ~30 GB per hour.

1 GBps NIC : Blitz Copy goes @ ~97 GB per hour while robocopy @ ~50 GB per hour.

ring0