We have an application that runs on Tomcat 6 (6.0.35.0 to be precise), and most of our engineers on Mac OS are having problems starting Tomcat due to the socketAccept call in the Catalina.await method throwing a SocketException:
SEVERE: StandardServer.await: accept:
java.net.SocketException: Invalid argument
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.socketAccept(PlainSocketImpl.java)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
at java.net.ServerSocket.implAccept(ServerSocket.java:522)
at java.net.ServerSocket.accept(ServerSocket.java:490)
at org.apache.catalina.core.StandardServer.await(StandardServer.java:431)
at org.apache.catalina.startup.Catalina.await(Catalina.java:676)
at org.apache.catalina.startup.Catalina.start(Catalina.java:628)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
at mycompany.tomcat.startup.ThreadDumpWrapper.main(ThreadDumpWrapper.java:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.tanukisoftware.wrapper.WrapperStartStopApp.run(WrapperStartStopApp.java:238)
at java.lang.Thread.run(Thread.java:722)
This causes Tomcat to shut down immediately after startup (and no small amount of rage). We think this has been with us for the duration on Mac OS w/ Java 1.7, in the last several months a lot of us have switched to Macbook Pros. Up until now, the only symptom was occasional zero byte responses from Tomcat, due to this exception also being thrown on a socketRead. Errors don't hit the logs and we'd individually shrugged it off as an isolated problem, and only found the cause when the startup problem started and I set a SocketException breakpoint:
Daemon Thread [http-8080-1] (Suspended (breakpoint at line 47 in SocketException))
SocketException.<init>(String) line: 47
SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]
SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available
SocketInputStream.read(byte[], int, int, int) line: 150
SocketInputStream.read(byte[], int, int) line: 121
InternalInputBuffer.fill() line: 735
InternalInputBuffer.parseRequestLine() line: 366
Http11Processor.process(Socket) line: 814
Http11Protocol$Http11ConnectionHandler.process(Socket) line: 602
JIoEndpoint$Worker.run() line: 489
Thread.run() line: 722
For arguments:
arg0 FileDescriptor (id=499)
fd 1097
useCount AtomicInteger (id=503)
value 2
arg1 (id=502)
arg2 0
arg3 8192
arg4 20000
The problem is time sensitive. Increasing startup time due to application changes (lots more Spring introspection/singleton overhead) seems to be the factor that causes this to affect Tomcat startup; the tipping point being about 160 seconds. We can mitigate the problem by disabling some of the non-mandatory contexts we don't need during development to reduce startup time, but I'd prefer to find the root cause.
The specifics of the application are far too complex to go into too much detail, but I have a hunch that this might relate to an earlier bind, so I'll at least list the listening ports on my machine:
localhost:32000 - Java service wrapper port
*:10001 - RMI registry
*:2322 - Java debug
*:56566 - RMI
*:8180 - Tomcat HTTP connector
*:8543 - Tomcat HTTPS connector
*:2223 - Tomcat Internal HTTP connector (used for cross-server requests)
*:14131 - 'Locking' port to determine if an internal service is running
*:56571 - EhCache RMI
*:56573 - RMI
*:62616 - ActiveMQ broker
*:5001 - SOAPMonitorService
*:8109 - Tomcat shutdown port
-Djava.net.preferIPv4Stack=true
. I've always had that option configured5000
to 10000
/etc/hosts
so the machine hostname resolves to localhost, and configuring JVM options to prefer IPv4 and to not prefer IPv6 addresses (This answer: https://stackoverflow.com/a/16318860/364206)For those interested in hosts configuration, it's the same as default. I can reproduce this on a Fusion VM w/ a clean install of 10.8:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
fe80::1%lo0 localhost
Due to the apparent time sensitive nature of the issue, setting breakpoints to troubleshoot the issue causes it to not occur. As requested in the comments, I also captured arg0
for SocksSocketImpl(PlainSocketImpl).socketAccept(SocketImpl)
, nothing seems out of the ordinary.
arg0 SocksSocketImpl (id=460)
address InetAddress (id=465)
canonicalHostName null
holder InetAddress$InetAddressHolder (id=475)
address 0
family 0
hostName null
applicationSetProxy false
closePending false
cmdIn null
cmdOut null
cmdsock null
CONNECTION_NOT_RESET 0
CONNECTION_RESET 2
CONNECTION_RESET_PENDING 1
external_address null
fd FileDescriptor (id=713)
fd -1
useCount AtomicInteger (id=771)
value 0
fdLock Object (id=714)
fdUseCount 0
localport 0
port 0
resetLock Object (id=716)
resetState 0
server null
serverPort 1080
serverSocket null
shut_rd false
shut_wr false
socket Socket (id=718)
bound false
closed false
closeLock Object (id=848)
connected false
created false
impl null
oldImpl false
shutIn false
shutOut false
socketInputStream null
stream false
timeout 0
trafficClass 0
useV4 false
I think all of the threads where the exceptions are thrown are victims of an earlier call, one that doesn't result in a SocketException so I haven't been able to catch it. Being able to start Tomcat by reducing startup times convinces me that the trigger is probably some scheduled task that performs a socket based operation, which then affects other socket operations.
That doesn't explain how and why this could affect several threads, no matter what we're doing to cause this condition, mysterious SocketExceptions shouldn't bubble up from native code and cause these exceptions simultaneously on multiple threads - that is, two threads making outgoing web service calls, the Tomcat await call, and several TP processor threads repeatedly.
Given the generic message, I assumed that an EINVAL
error must be returned from one of the system calls in the socketAccept JNI code, so I traced the system calls leading up to the exception; there's no EINVAL
returned from any system call. So, I went to the OpenJDK sources looking for conditions in the socketAccept code that would set and then throw an EINVAL
, but I also couldn't find any code that sets errno
to EINVAL
, or calls NET_ThrowByNameWithLastError
, NET_ThrowCurrent
or NET_ThrowNew
in a way that would throw a SocketException with this default error message.
As far as the system calls, we don't seem to get as far as the accept system call:
PID/THRD RELATIVE ELAPSD CPU SYSCALL(args) = return
6606/0x2c750d: 221538243 5 0 sigprocmask(0x1, 0x0, 0x14D8BE100) = 0x0 0
6606/0x2c750d: 221538244 3 0 sigaltstack(0x0, 0x14D8BE0F0, 0x0) = 0 0
6606/0x2c750d: 221538836 14 10 socket(0x2, 0x1, 0x0) = 1170 0
6606/0x2c750d: 221538837 3 0 fcntl(0x492, 0x3, 0x4) = 2 0
6606/0x2c750d: 221538839 3 1 fcntl(0x492, 0x4, 0x6) = 0 0
6606/0x2c750d: 221538842 5 2 setsockopt(0x492, 0xFFFF, 0x4) = 0 0
6606/0x2c750d: 221538852 7 4 bind(0x492, 0x14D8BE5D8, 0x10) = 0 0
6606/0x2c750d: 221538857 5 2 listen(0x492, 0x1, 0x4) = 0 0
6606/0x2c750d: 221539625 6 2 psynch_cvsignal(0x7FEFBFE00868, 0x10000000200, 0x100) = 257 0
6606/0x2c750d: 221539633 4 1 write(0x2, "Apr 18, 2013 11:05:35 AM org.apache.catalina.core.StandardServer await\nSEVERE: StandardServer.await: accept: \njava.net.SocketException: Invalid argument\n\tat java.net.PlainSocketImpl.socketAccept(Native Method)\n\tat java.net.PlainSocketImpl.socketAcce", 0x644) = 1604 0
So, I think the problem occurs in the timeout handling code at the top of the accept loop in socketAccept
, but I couldn't find any case where NET_Timeout
would set errno
to EINVAL
, and result in this SocketException being thrown. I'm referring to this code; I assume the jdk7u branch is for the most part what ships in the Oracle JDK:
I can't find anyone in the outside world affected by this particular problem on Mac OS, but almost everyone here is affected. There must be some application configuration that contributes, but I've exhausted every avenue I can think of to find the root cause.
Pointers on troubleshooting or insight on a possible cause would be much appreciated.
Have you tried turning on JNI debugging with -Xcheck:jni
? Interestingly the Oracle documentation uses a PlainSocketImpl.socketAccept
error as an example of when to use this.
Note also that the implication of Bug 7131399 is that the JNI uses poll()
on most platforms but select()
on Mac OS due to a problem with poll()
on the Mac. So maybe select()
is broken too. Digging in a bit further, select() will return EINVAL if "ndfs is greater than FD_SETSIZE and _DARWIN_UNLIMITED_SELECT is not defined." FD_SETSIZE is 1024 and it sounds like you have a ton of applications loading up, so perhaps it all filters down to waiting on more that 1024 FDs at one time.
For extra credit, see if the related (supposedly fixed) Java bug is in fact fixed on your machine. The bug report has pointers to test cases.
Thanks to Old Pro's answer, I confirmed that the select()
FD_SETSIZE limitation is the cause. I located an existing bug for this limitation:
https://bugs.openjdk.java.net/browse/JDK-8021820
The problem can be reproduced with the following code:
import java.io.*;
import java.net.*;
public class SelectTest {
public static void main(String[] args) throws Exception {
// Use 1024 file descriptors. There'll already be some in use, obviously, but this guarantees the problem will occur
for(int i = 0; i < 1024; i++) {
new FileInputStream("/dev/null");
}
ServerSocket socket = new ServerSocket(8080);
socket.accept();
}
}
Almost a year later, Java 7u60 has a fix this problem:
http://www.oracle.com/technetwork/java/javase/2col/7u60-bugfixes-2202029.html
I also discovered the Tomcat's WebappClassLoader closes file handles after 90 seconds, which explains why setting break points prevented the issue from occurring.