Tomcat startup fails due to 'java.net.SocketException Invalid argument' on Mac OS X

Danny Thomas picture Danny Thomas · Apr 24, 2013 · Viewed 12.9k times · Source

We have an application that runs on Tomcat 6 (6.0.35.0 to be precise), and most of our engineers on Mac OS are having problems starting Tomcat due to the socketAccept call in the Catalina.await method throwing a SocketException:

SEVERE: StandardServer.await: accept:
java.net.SocketException: Invalid argument
      at java.net.PlainSocketImpl.socketAccept(Native Method)
      at java.net.PlainSocketImpl.socketAccept(PlainSocketImpl.java)
      at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
      at java.net.ServerSocket.implAccept(ServerSocket.java:522)
      at java.net.ServerSocket.accept(ServerSocket.java:490)
      at org.apache.catalina.core.StandardServer.await(StandardServer.java:431)
      at org.apache.catalina.startup.Catalina.await(Catalina.java:676)
      at org.apache.catalina.startup.Catalina.start(Catalina.java:628)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:601)
      at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
      at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
      at mycompany.tomcat.startup.ThreadDumpWrapper.main(ThreadDumpWrapper.java:260)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:601)
      at org.tanukisoftware.wrapper.WrapperStartStopApp.run(WrapperStartStopApp.java:238)
      at java.lang.Thread.run(Thread.java:722)

This causes Tomcat to shut down immediately after startup (and no small amount of rage). We think this has been with us for the duration on Mac OS w/ Java 1.7, in the last several months a lot of us have switched to Macbook Pros. Up until now, the only symptom was occasional zero byte responses from Tomcat, due to this exception also being thrown on a socketRead. Errors don't hit the logs and we'd individually shrugged it off as an isolated problem, and only found the cause when the startup problem started and I set a SocketException breakpoint:

Daemon Thread [http-8080-1] (Suspended (breakpoint at line 47 in SocketException))  
  SocketException.<init>(String) line: 47 
  SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method] 
  SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available  
  SocketInputStream.read(byte[], int, int, int) line: 150 
  SocketInputStream.read(byte[], int, int) line: 121  
  InternalInputBuffer.fill() line: 735  
  InternalInputBuffer.parseRequestLine() line: 366  
  Http11Processor.process(Socket) line: 814 
  Http11Protocol$Http11ConnectionHandler.process(Socket) line: 602  
  JIoEndpoint$Worker.run() line: 489  
  Thread.run() line: 722  

For arguments:

arg0  FileDescriptor  (id=499)  
  fd  1097  
  useCount  AtomicInteger  (id=503) 
    value 2 
arg1  (id=502)
arg2  0 
arg3  8192  
arg4  20000 

The problem is time sensitive. Increasing startup time due to application changes (lots more Spring introspection/singleton overhead) seems to be the factor that causes this to affect Tomcat startup; the tipping point being about 160 seconds. We can mitigate the problem by disabling some of the non-mandatory contexts we don't need during development to reduce startup time, but I'd prefer to find the root cause.

Application configuration

The specifics of the application are far too complex to go into too much detail, but I have a hunch that this might relate to an earlier bind, so I'll at least list the listening ports on my machine:

localhost:32000 - Java service wrapper port
*:10001         - RMI registry
*:2322          - Java debug
*:56566         - RMI
*:8180          - Tomcat HTTP connector
*:8543          - Tomcat HTTPS connector
*:2223          - Tomcat Internal HTTP connector (used for cross-server requests)
*:14131         - 'Locking' port to determine if an internal service is running
*:56571         - EhCache RMI
*:56573         - RMI
*:62616         - ActiveMQ broker
*:5001          - SOAPMonitorService
*:8109          - Tomcat shutdown port

Items ruled out

  • The most obvious solution: -Djava.net.preferIPv4Stack=true. I've always had that option configured
  • Any recent configuration change to our base application configuration, libraries, JVM options (there aren't any)
  • A JDK regression. I've tested JDK 1.7.0_09, 11, 15, 17 and 21 (the JDKs I've had installed on my machine for the duration)
  • Mac OS update. Mac OS 10.7.x and 10.8.0 through 1.8.3 are affected
  • File descriptor limits - increased from 5000 to 10000
  • Disabling IPv6 completely on the main ethernet interface
  • Setting breakpoints, and removing the first contexts to be affected by the SocketException (they're outgoing HTTP calls to web services). No change
  • Configuring /etc/hosts so the machine hostname resolves to localhost, and configuring JVM options to prefer IPv4 and to not prefer IPv6 addresses (This answer: https://stackoverflow.com/a/16318860/364206)

For those interested in hosts configuration, it's the same as default. I can reproduce this on a Fusion VM w/ a clean install of 10.8:

##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1   localhost
255.255.255.255 broadcasthost
::1             localhost
fe80::1%lo0 localhost

Java code investigation

Due to the apparent time sensitive nature of the issue, setting breakpoints to troubleshoot the issue causes it to not occur. As requested in the comments, I also captured arg0 for SocksSocketImpl(PlainSocketImpl).socketAccept(SocketImpl), nothing seems out of the ordinary.

arg0  SocksSocketImpl  (id=460) 
  address InetAddress  (id=465) 
    canonicalHostName null  
    holder  InetAddress$InetAddressHolder  (id=475) 
      address 0 
      family  0 
      hostName  null  
  applicationSetProxy false 
  closePending  false 
  cmdIn null  
  cmdOut  null  
  cmdsock null  
  CONNECTION_NOT_RESET  0 
  CONNECTION_RESET  2 
  CONNECTION_RESET_PENDING  1 
  external_address  null  
  fd  FileDescriptor  (id=713)  
    fd  -1  
    useCount  AtomicInteger  (id=771) 
      value 0 
  fdLock  Object  (id=714)  
  fdUseCount  0 
  localport 0 
  port  0 
  resetLock Object  (id=716)  
  resetState  0 
  server  null  
  serverPort  1080  
  serverSocket  null  
  shut_rd false 
  shut_wr false 
  socket  Socket  (id=718)  
    bound false 
    closed  false 
    closeLock Object  (id=848)  
    connected false 
    created false 
    impl  null  
    oldImpl false 
    shutIn  false 
    shutOut false 
  socketInputStream null  
  stream  false 
  timeout 0 
  trafficClass  0 
  useV4 false 

I think all of the threads where the exceptions are thrown are victims of an earlier call, one that doesn't result in a SocketException so I haven't been able to catch it. Being able to start Tomcat by reducing startup times convinces me that the trigger is probably some scheduled task that performs a socket based operation, which then affects other socket operations.

That doesn't explain how and why this could affect several threads, no matter what we're doing to cause this condition, mysterious SocketExceptions shouldn't bubble up from native code and cause these exceptions simultaneously on multiple threads - that is, two threads making outgoing web service calls, the Tomcat await call, and several TP processor threads repeatedly.

JNI code investigation

Given the generic message, I assumed that an EINVAL error must be returned from one of the system calls in the socketAccept JNI code, so I traced the system calls leading up to the exception; there's no EINVAL returned from any system call. So, I went to the OpenJDK sources looking for conditions in the socketAccept code that would set and then throw an EINVAL, but I also couldn't find any code that sets errno to EINVAL, or calls NET_ThrowByNameWithLastError, NET_ThrowCurrent or NET_ThrowNew in a way that would throw a SocketException with this default error message.

As far as the system calls, we don't seem to get as far as the accept system call:

 PID/THRD        RELATIVE   ELAPSD    CPU SYSCALL(args)    = return
 6606/0x2c750d:  221538243       5      0 sigprocmask(0x1, 0x0, 0x14D8BE100)    = 0x0 0
 6606/0x2c750d:  221538244       3      0 sigaltstack(0x0, 0x14D8BE0F0, 0x0)     = 0 0
 6606/0x2c750d:  221538836      14     10 socket(0x2, 0x1, 0x0)    = 1170 0
 6606/0x2c750d:  221538837       3      0 fcntl(0x492, 0x3, 0x4)     = 2 0
 6606/0x2c750d:  221538839       3      1 fcntl(0x492, 0x4, 0x6)     = 0 0
 6606/0x2c750d:  221538842       5      2 setsockopt(0x492, 0xFFFF, 0x4)     = 0 0
 6606/0x2c750d:  221538852       7      4 bind(0x492, 0x14D8BE5D8, 0x10)     = 0 0
 6606/0x2c750d:  221538857       5      2 listen(0x492, 0x1, 0x4)    = 0 0
 6606/0x2c750d:  221539625       6      2 psynch_cvsignal(0x7FEFBFE00868, 0x10000000200, 0x100)    = 257 0
 6606/0x2c750d:  221539633       4      1 write(0x2, "Apr 18, 2013 11:05:35 AM org.apache.catalina.core.StandardServer await\nSEVERE: StandardServer.await: accept: \njava.net.SocketException: Invalid argument\n\tat java.net.PlainSocketImpl.socketAccept(Native Method)\n\tat java.net.PlainSocketImpl.socketAcce", 0x644)    = 1604 0

So, I think the problem occurs in the timeout handling code at the top of the accept loop in socketAccept, but I couldn't find any case where NET_Timeout would set errno to EINVAL, and result in this SocketException being thrown. I'm referring to this code; I assume the jdk7u branch is for the most part what ships in the Oracle JDK:

Help!

I can't find anyone in the outside world affected by this particular problem on Mac OS, but almost everyone here is affected. There must be some application configuration that contributes, but I've exhausted every avenue I can think of to find the root cause.

Pointers on troubleshooting or insight on a possible cause would be much appreciated.

Answer

Old Pro picture Old Pro · May 3, 2013

Have you tried turning on JNI debugging with -Xcheck:jni? Interestingly the Oracle documentation uses a PlainSocketImpl.socketAccept error as an example of when to use this.

Note also that the implication of Bug 7131399 is that the JNI uses poll() on most platforms but select() on Mac OS due to a problem with poll() on the Mac. So maybe select() is broken too. Digging in a bit further, select() will return EINVAL if "ndfs is greater than FD_SETSIZE and _DARWIN_UNLIMITED_SELECT is not defined." FD_SETSIZE is 1024 and it sounds like you have a ton of applications loading up, so perhaps it all filters down to waiting on more that 1024 FDs at one time.

For extra credit, see if the related (supposedly fixed) Java bug is in fact fixed on your machine. The bug report has pointers to test cases.


Thanks to Old Pro's answer, I confirmed that the select() FD_SETSIZE limitation is the cause. I located an existing bug for this limitation:

https://bugs.openjdk.java.net/browse/JDK-8021820

The problem can be reproduced with the following code:

import java.io.*;
import java.net.*;

public class SelectTest {
  public static void main(String[] args) throws Exception {
    // Use 1024 file descriptors. There'll already be some in use, obviously, but this guarantees the problem will occur
    for(int i = 0; i < 1024; i++) {
      new FileInputStream("/dev/null");
    }
    ServerSocket socket = new ServerSocket(8080);
    socket.accept();
  }
}

Almost a year later, Java 7u60 has a fix this problem:

http://www.oracle.com/technetwork/java/javase/2col/7u60-bugfixes-2202029.html

I also discovered the Tomcat's WebappClassLoader closes file handles after 90 seconds, which explains why setting break points prevented the issue from occurring.