I've been trying to find out the fastest way to code a file copy routine to copy a large file onto a RAID 5 hardware.
The average file size is around 2 GB.
There are 2 windows boxes (both running win2k3). The first box is the source, where is the large file is located. And the second box has a RAID 5 storage.
http://blogs.technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx
The above link clearly explains why windows copy, robocopy and other common copy utilities suffer in write performance.
Hence, i've written a C/C++ program that uses CreateFile, ReadFile & WriteFile API's with NO_BUFFERING
& WRITE_THROUGH
flags. The program simulates ESEUTIL.exe, in the sense, it uses 2 threads, one for reading and one for writing. The reader thread reads 256 KB from source and fills a buffer. Once 16 such 256 KB blocks are filled, the writer thread writes the contents in the buffer to the destination file. As you can see, the writer thread writes 8MB of data in 1 shot. The program allocates 32 such 8MB blocks... hence, the writing and reading can happen in parallel.
Details of ESEUtil.exe can be found in the above link.
Note: I am taking care of the data alignment issues when using NO_BUFFERING
.
I used bench marking utilities like ATTO and found out that our RAID 5 hardware has a write speed of 44MB per second when writing 8MB data chunk. Which is around 2.57 GB per minute.
But my program is able to achieve only 1.4 GB per minute.
Can anyone please help me identify what the problem is? Are there faster API's other that CreateFile
, ReadFile
, WriteFile
available?
You should use async IO to get the best performance. That is opening the file with FILE_FLAG_OVERLAPPED
and using the LPOVERLAPPED
argument of WriteFile. You may or may not get better performance with FILE_FLAG_NO_BUFFERING
. You will have to test to see.
FILE_FLAG_NO_BUFFERING
will generally give you more consistent speeds and better streaming behavior, and it avoids polluting your disk cache with data that you may not need again, but it isn't necessarily faster overall.
You should also test to see what the best size is for each block of IO. In my experience There is a huge performance difference between copying a file 4k at a time and copying it 1Mb at a time.
In my past testing of this (a few years ago) I found that block sizes below about 64kB were dominated by overhead, and total throughput continued to improve with larger block sizes up to about 512KB. I wouldn't be surprised if with today's drives you needed to use block sizes larger than 1MB to get maximum throughput.
The numbers you are currently using appear to be reasonable, but may not be optimal. Also I'm fairly certain that FILE_FLAG_WRITE_THROUGH prevents the use of the on-disk cache and thus will cost you a fair bit of performance.
You need to also be aware that copying files using CreateFile/WriteFile will not copy metadata such as timestamps or alternate data streams on NTFS. You will have to deal with these things on your own.
Actually replacing CopyFile
with your own code is quite a lot of work.
Addendum:
I should probably mention that when I tried this with software Raid 0 on WindowsNT 3.0 (about 10 years ago). The speed was VERY sensitive to the alignment in memory of the buffers. It turned out that at the time, the SCSI drivers had to use a special algorithm for doing DMA from a scatter/gather list, when the DMA was more than 16 physical regions of memory (64Kb). To get guranteed optimal performance required physically contiguous allocations - which is something that only drivers can request. This was basically a workaround for a bug in the DMA controller of a popular chipset back then, and is unlikely to still be an issue.
BUT - I would still strongly suggest that you test ALL power of 2 block sizes from 32kb to 32Mb to see which is faster. And you might consider testing to see if some buffers are consistently faster than others - it's not unheard of.