I am experiencing a weird problem with the Java ProcessBuilder
. The code is shown below (in a slightly simplified form)
public class Whatever implements Runnable
{
public void run(){
//someIdentifier is a randomly generated string
String in = someIdentifier + "input.txt";
String out = someIdentifier + "output.txt";
ProcessBuilder builder = new ProcessBuilder("./whateveer.sh", in, out);
try {
Process process = builder.start();
process.waitFor();
} catch (IOException e) {
log.error("Could not launch process. Command: " + builder.command(), e);
} catch (InterruptedException ex) {
log.error(ex);
}
}
}
whatever.sh reads:
R --slave --args $1 $2 <whatever1.R >> r.log
Loads of instances of Whatever
are submitted to an ExecutorService
of fixed size (35). The rest of the application waits for all of them to finish- implemented with a CountdownLatch
. Everything runs fine for several hours (Scientific Linux 5.0, java version "1.6.0_24") before throwing the following exception:
java.io.IOException: Cannot run program "./whatever.sh": java.io.IOException: error=11, Resource temporarily unavailable
at java.lang.ProcessBuilder.start(Unknown Source)
... rest of stack trace omitted...
Does anyone have an idea what this means? Based on the google/bing search results for java.io.IOException: error=11
, it is not the most common of exceptions and I am completely baffled.
My wild and not so educated guess is that I have too many threads trying to launch the same file at the same time. However, it takes hours of CPU time to reproduce the problem, so I have not tried with a smaller number.
Any suggestions are greatly appreciated.
The error=11
is almost certainly the EAGAIN
error code:
$ grep EAGAIN asm-generic/errno-base.h
#define EAGAIN 11 /* Try again */
The clone(2)
system call documents an EAGAIN
error return:
EAGAIN Too many processes are already running.
The fork(2)
system call documents two EAGAIN
error returns:
EAGAIN fork() cannot allocate sufficient memory to copy the
parent's page tables and allocate a task structure for
the child.
EAGAIN It was not possible to create a new process because
the caller's RLIMIT_NPROC resource limit was
encountered. To exceed this limit, the process must
have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE
capability.
If you were really that low on memory, it would almost certainly show in the system logs. Check dmesg(1)
output or /var/log/syslog
for any potential messages about low system memory. (Other things would break. This doesn't seem too plausible.)
Much more likely is running into either the per-user limit on processes or system-wide maximum number of processes. Perhaps one of your processes isn't properly reapting zombies? This would be very easy to spot by checking ps(1)
output over time:
while true ; do ps auxw >> ~/processes ; sleep 10 ; done
(Maybe check every minute or ten minutes if it really does take hours before you're in trouble.)
If you're not reaping zombies, then read up on whatever you must do to ProcessBuilder to use waitpid(2)
to reap your dead children.
If you're legitimately running more processes than your rlimits allow, you'll need to use ulimit
in your bash(1)
scripts (if running as root
) or set higher limits in /etc/security/limits.conf
for the nproc
property.
If you are instead running into the system-wide process limits, you might need to write a larger value into /proc/sys/kernel/pid_max
. See proc(5)
for some (short) details.