To preface this, I've been all over the internet attempting to find a solution. Below are just the latest links that have provided some information, none of which however seems to be working.
Tomcat stops responding to Apache
Despite many configuration changes which I outline below I have not been able to prevent the errors, which appear in the log thusly:
[Tue Jan 07 14:56:12.158345 2014] [proxy_ajp:error] [pid 12094:tid 140002805655296] (70007)The timeout specified has expired: AH01030: ajp_ilink_receive() can't receive header
[Tue Jan 07 14:56:12.158409 2014] [proxy_ajp:error] [pid 12094:tid 140002805655296] [client 10.4.65.146:58551] AH00992: ajp_read_header: ajp_ilink_receive failed, referer: http://xxxx/yyy/
[Tue Jan 07 14:56:12.158430 2014] [proxy_ajp:error] [pid 12094:tid 140002805655296] (70007)The timeout specified has expired: [client 10.4.65.146:58551] AH00878: read response failed from 10.4.3.33:8009 (tomcatworkerX), referer: http://xxxx/yyy/
[Tue Jan 07 14:56:12.229559 2014] [proxy_balancer:error] [pid 12094:tid 140002932012800] [client 10.4.230.138:57407] AH01167: balancer://lb: All workers are in error state for route (tomcatworkerX), referer: http://xxxx/yyy/zzz
Users that go down see the "Server Unavailable" screen, but the connect restores after a few minutes. Sometimes however the same server connection goes up/down many times; this could be due to user behavior on the same (I use sticky sessions) but I haven't been able to confirm this.
My configuration is that I have a single Apache Webserver instance running in a Windows environment, with 4 Tomcat workers configured via AJP. All Tomcat workers, currently, are hosted under Windows on separate hosts.
All hosts in my scenario are VMs in a robust production environment, with multiple cores devoted to each.
Apache Version:
Server version: Apache/2.2.22 (Win32)
Tomcat is version 7.0.29
Each BalancerMember has these configuration parameters:
keepalive=On timeout=600 ttl=600
Each Tomcat instance currently using the native connector (org.apache.coyote.ajp.AjpAprProtocol).
Connector config:
<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" maxThreads="450" connectionTimeout="600000" />
The application itself connects to Oracle via the Oracle ojdbc15_g JDBC driver, v11.2.0.3.0.
Things that I have observed:
As to what I am doing now to try and address it, I feel I've exhausted my ability to configure (which includes googling for every possible solution online, as I'm a software guy by trade, not infrastructure) .. so I'm trying a different tact by switching platforms: I have run up Apache Webserver on a Linux machine, and using a DNS round-robin a portion of users get routed through Linux rather than Windows. This doesn't appear to have helped, but the Tomcat workers are still running on the same Windows boxes.
I'm currently getting the Tomcat app itself up on a Linux machine as well, and when I have that stable (some minor code changes are necessary due to assumptions about Windows being the only platform the app would be hosted on) I will add that as a worker to see if that particular instance encounters the same issues.
If nothing else, I'd like confirmation that my suspicion about the long-executing requests is the right path. I've tried various configuration changes to no avail.
that error is in the apache error_log here...
We had a ELB with a timeout of 600 in front of apache... tomcat set to a timeout of 600
Our error was teh webserver timeout
If apache is not explicitly configured its timeout is 60 seconds
TimeOut 600 in the httpd.conf for example
the timeout between the apache webserver and the tomcat instance could time out on long running sessions ... a long api call for instance.