Why is dd with the 'direct' (O_DIRECT) flag so dramatically faster?

Question 1

Why is dd with the 'direct' (O_DIRECT) flag so dramatically faster?

linux file io dd

Joseph Garvin · Nov 2, 2015 · Viewed 19.4k times · Source

Answer

Answer

In the oflag=direct case:

You are giving the kernel the ability to write data out straight away rather than filling a buffer and waiting for a threshold/timeout to be hit (which in turn means that data is less likely to be held up behind a sync of unrelated data).
You are saving the kernel work (no extra copies from userland to the kernel, no need to perform most buffer cache management operations).
In some cases, dirtying buffers faster than they can be flushed will result in the program generating the dirty buffers being made to wait until pressure on arbitrary limits is relieved (see SUSE's "Low write performance on SLES 11/12 servers with large RAM").

More generally, that giant block size (1 MByte) is likely ~~bigger than the RAID's block size so the I/O will be split up within the kernel and those smaller pieces submitted in parallel, thus~~ big enough that the coalescing you get from buffered writeback with tiny I/Os won't be worth much (the exact point that the kernel will start splitting I/Os depends on a number of factors. Further, while RAID stripe sizes can be larger than 1 MByte, the kernel isn't always aware of this for hardware RAID. In the case of software RAID the kernel can sometimes optimize for stripe size - e.g. the kernel I'm on knows the md0 device has a 4 MByte stripe size and express a hint that it prefers I/O in that size via /sys/block/md0/queue/optimal_io_size).

Given all the above, IF you were maxing out a single CPU during the original buffered copy AND your workload doesn't benefit much from caching/coalescing BUT the disk could handle more throughput THEN doing the O_DIRECT copy should go faster as there's more CPU time available for userspace/servicing disk I/Os due to the reduction in kernel overhead.

So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache?

It's not just an extra memcpy per I/O that is involved - think about all the extra cache machinery that must be maintained. There is a nice explanation about how copying a buffer to the kernel isn't instantaneous and how page pressure can slow things down in an answer to the Linux async (io_submit) write v/s normal (buffered) write question. However, unless your program can generate data fast enough AND the CPU is so overloaded it can't feed the disk quickly enough then it usually doesn't show up or matter.

Is this atypical?

No, your result is quite typical with the sort of workload you were using. I'd imagine it would be a very different outcome if the blocksize were tiny (e.g. 512 bytes) though.

Let's compare some of fio's output to help us understand this:

$ fio --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M_no_fsync
buffered_1M_no_fsync: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2511MiB/s][r=0,w=2510 IOPS][eta 00m:00s]
buffered_1M_no_fsync: (groupid=0, jobs=1): err= 0: pid=25408: Sun Aug 25 09:10:31 2019
  write: IOPS=2100, BW=2100MiB/s (2202MB/s)(20.0GiB/9752msec)
[...]
  cpu          : usr=2.08%, sys=97.72%, ctx=114, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%

So using buffering we wrote at about 2.1 GBytes/s but used up a whole CPU to do so. However, the block device (md0) says it barely saw any I/O (ios=0/3 - only three write I/Os) which likely means most of the I/O was cached in RAM! As this particular machine could easily buffer 20 GBytes in RAM we shall do another run with end_fsync=1 to force any data that may only have been in the kernel's RAM cache at the end of the run to be pushed to disk thus ensuring we record the time it took for all the data to actually reach non-volatile storage:

$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M
buffered_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]      
buffered_1M: (groupid=0, jobs=1): err= 0: pid=41884: Sun Aug 25 09:13:01 2019
  write: IOPS=1928, BW=1929MiB/s (2023MB/s)(20.0GiB/10617msec)
[...]
  cpu          : usr=1.77%, sys=97.32%, ctx=132, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/40967, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2561, aggrmerge=0/2559, aggrticks=0/132223, aggrin_queue=127862, aggrutil=21.36%

OK now the speed has dropped to about 1.9 GBytes/s and we still use all a CPU but the disks in the RAID device claim they had capacity to go faster (aggrutil=21.36%). Next up direct I/O:

$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_1M 
direct_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=3242MiB/s][r=0,w=3242 IOPS][eta 00m:00s]
direct_1M: (groupid=0, jobs=1): err= 0: pid=75226: Sun Aug 25 09:16:40 2019
  write: IOPS=2252, BW=2252MiB/s (2361MB/s)(20.0GiB/9094msec)
[...]
  cpu          : usr=8.71%, sys=38.14%, ctx=20621, majf=0, minf=83
[...]
Disk stats (read/write):
    md0: ios=0/40966, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5120, aggrmerge=0/0, aggrticks=0/1283, aggrin_queue=1, aggrutil=0.09%

Going direct we use just under 50% of a CPU to do 2.2 GBytes/s (but notice how I/Os weren't merged and how we did far more userspace/kernel context switches). If we were to push more I/O per syscall things change:

$ fio --bs=4M --size=20G --rw=write --filename=zeroes --name=buffered_4M_no_fsync
buffered_4M_no_fsync: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2390MiB/s][r=0,w=597 IOPS][eta 00m:00s]
buffered_4M_no_fsync: (groupid=0, jobs=1): err= 0: pid=8029: Sun Aug 25 09:19:39 2019
  write: IOPS=592, BW=2370MiB/s (2485MB/s)(20.0GiB/8641msec)
[...]
  cpu          : usr=3.83%, sys=96.19%, ctx=12, majf=0, minf=1048
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%

$ fio --end_fsync=1 --bs=4M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_4M
direct_4M: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=5193MiB/s][r=0,w=1298 IOPS][eta 00m:00s]
direct_4M: (groupid=0, jobs=1): err= 0: pid=92097: Sun Aug 25 09:22:39 2019
  write: IOPS=866, BW=3466MiB/s (3635MB/s)(20.0GiB/5908msec)
[...]
  cpu          : usr=10.02%, sys=44.03%, ctx=5233, majf=0, minf=12
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%

With a massive block size of 4 MBytes buffered I/O became bottlenecked at "just" 2.3 GBytes/s (even when we didn't force the cache to be flushed) due to the fact that there's no CPU left. Direct I/O used around 55% of a CPU and managed to reach 3.5 GBytes/s so it was roughly 50% faster than buffered I/O.

Summary: Your I/O pattern doesn't really benefit from buffering (I/Os are huge, data is not being reused, I/O is streaming sequential) so you're in an optimal scenario for O_DIRECT being faster. See these slides by the original author of Linux's O_DIRECT (longer PDF document that contains an embedded version of most of the slides) for the original motivation behind it.

Question 2

I have a server with a RAID50 configuration of 24 drives (two groups of 12), and if I run:

dd if=/dev/zero of=ddfile2 bs=1M count=1953 oflag=direct

I get:

2047868928 bytes (2.0 GB) copied, 0.805075 s, 2.5 GB/s

But if I run:

dd if=/dev/zero of=ddfile2 bs=1M count=1953

I get:

2047868928 bytes (2.0 GB) copied, 2.53489 s, 808 MB/s

I understand that O_DIRECT causes the page cache to be bypassed. But as I understand it bypassing the page cache basically means avoiding a memcpy. Testing on my desktop with the bandwidth tool I have a worst case sequential memory write bandwidth of 14GB/s, and I imagine on the newer much more expensive server the bandwidth must be even better. So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache? Is this atypical?

Why is dd with the 'direct' (O_DIRECT) flag so dramatically faster?

Answer

Related questions