High iowait with java processes on linux

Per Steffensen picture Per Steffensen · Feb 21, 2012 · Viewed 7.7k times · Source

I have a concurrent system with many machines/nodes involved. Each machine run several JVMs doing different stuff. It is a "layered" architecture where each layer consists of many JVM running across the machines. Basically the top-layer JVM receives input from the outside via files, parses the input and sends it as many small records for "storage" in layer-two. Layer-two doesn't actually persist the data itself but actually persists it in layer-three (HBase and Solr) and HBase actually doesn't persist it itself either since it sends it to layer-four (HDFS) for persistence.

Most of the communication among the layers is synchronized so of course it ends up in a lot of threads waiting for lower layers to complete. But I would expect those waiting threads to be "free" wrt CPU usage.

I see a very high iowait (%wa in top) though - something like 80-90% iowait and only 10-20% sys/usr CPU usage. The system seems exhausted - slow to login via ssh and slow to respond to commands etc.

My question is if all those JVM threads waiting for lower layers to complete can cause this? Is it not supposed to be "free" waiting for responses (sockets). Does it matter with respect to this, whether the different layers uses blocking or non-blocking (NIO) io? Exactly in what situations does Linux count something as iowait (%wa in top)? When all threads in all JVMs on the machines are in a situation where it is waiting (counting because there is no other thread to run to do something meaningful in the meantime)? Or does threads waiting also count in %wa even though there are other processes ready to use the CPU for real processing?

I would really want to get a thorough explanation on how it works and how to interpret this high %wa. In the beginning I guessed that it counted as %wa when all threads where waiting, but that there where actually plenty of room for doing more, so I tried to increase the number of threads expecting to get more throughput, but that doesn't happen. So it is a real problem, not just a "visual" problem looking at top.

The output below is taken from a machine where only HBase and HDFS is running. It is on machines with HBase and/or HDFS that the problem i showing (most clearly)

--- jps ---
19498 DataNode
19690 HRegionServer
19327 SecondaryNameNode

---- typical top -------
top - 11:13:21 up 14 days, 18:20,  1 user,  load average: 4.83, 4.50, 4.25
Tasks:  99 total,   1 running,  98 sleeping,   0 stopped,   0 zombie
Cpu(s): 14.1%us,  4.3%sy,  0.0%ni,  5.4%id, 74.8%wa,  0.0%hi,  1.3%si,  0.0%st
Mem:   7133800k total,  7099632k used,    34168k free,    55540k buffers
Swap:   487416k total,      248k used,   487168k free,  2076804k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM TIME+
COMMAND
19690 hbase     20   0 4629m 4.2g 9244 S   51 61.7 194:08.84 java
19498 hdfs      20   0 1030m 116m 9076 S   16  1.7  75:29.26 java

---- iostat -kd 1 ----
root@edrxen1-2:~# iostat -kd 1
Linux 2.6.32-29-server (edrxen1-2)      02/22/2012      _x86_64_        (2 CPU)
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              3.53         3.36        15.66    4279502   19973226
dm-0            319.44      6959.14       422.37 8876213913  538720280
dm-1              0.00         0.00         0.00        912        624
xvdb            229.03      6955.81       406.71 8871957888  518747772
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0
dm-0            122.00      3852.00         0.00       3852          0
dm-1              0.00         0.00         0.00          0          0
xvdb            105.00      3252.00         0.00       3252          0
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
xvda              0.00         0.00         0.00          0          0
dm-0             57.00      1712.00         0.00       1712          0
dm-1              0.00         0.00         0.00          0          0
xvdb             78.00      2428.00         0.00       2428          0

--- iostat -x ---
Linux 2.6.32-29-server (edrxen1-2)      02/22/2012      _x86_64_        (2 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.06    0.00    3.29   65.14    0.08   23.43
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
xvda              0.00     0.74    0.35    3.18     6.72    31.32    10.78     0.11   30.28   6.24   2.20
dm-0              0.00     0.00  213.15  106.59 13866.95   852.73    46.04     1.29   14.41   2.83  90.58
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     8.00     0.00    5.78   1.12   0.00
xvdb              0.07    86.97  212.73   15.69 13860.27   821.42    64.27     2.44   25.21   3.96  90.47

--- free -o ----
             total       used       free     shared    buffers     cached
Mem:       7133800    7099452      34348          0      55612    2082364
Swap:       487416        248     487168

Answer

user149341 picture user149341 · Feb 22, 2012

IO wait on Linux indicates that processes are blocked on uninterruptible I/O. In practice, it typically means that the process is performing disk access -- in this case, I'd guess one of the following:

  • hdfs is performing a lot of disk accesses, and it's making other disk access slow as a result. (Checking iostat -x may help, as it'll show an extra "%util" column which indicates what percentage of the time the disk is "busy".)
  • You're running low on system memory under load, and are ending up dipping into swap sometimes.