Apache Webserver, Tomcat, AJP, "All workers are in error state for route"

Question 1

Apache Webserver, Tomcat, AJP, "All workers are in error state for route"

apache tomcat load-balancing ajp mod-proxy-balancer

ronchalant · Jan 7, 2014 · Viewed 7.9k times · Source

Answer

Answer

that error is in the apache error_log here...

We had a ELB with a timeout of 600 in front of apache... tomcat set to a timeout of 600

Our error was teh webserver timeout

If apache is not explicitly configured its timeout is 60 seconds

TimeOut 600 in the httpd.conf for example

the timeout between the apache webserver and the tomcat instance could time out on long running sessions ... a long api call for instance.

Question 2

To preface this, I've been all over the internet attempting to find a solution. Below are just the latest links that have provided some information, none of which however seems to be working.

https://serverfault.com/questions/19947/apachetomcat-having-problems-communicating-unclear-error-messages-bringing-do

Tomcat stops responding to Apache

Despite many configuration changes which I outline below I have not been able to prevent the errors, which appear in the log thusly:

[Tue Jan 07 14:56:12.158345 2014] [proxy_ajp:error] [pid 12094:tid 140002805655296] (70007)The timeout specified has expired: AH01030: ajp_ilink_receive() can't receive header
[Tue Jan 07 14:56:12.158409 2014] [proxy_ajp:error] [pid 12094:tid 140002805655296] [client 10.4.65.146:58551] AH00992: ajp_read_header: ajp_ilink_receive failed, referer: http://xxxx/yyy/
[Tue Jan 07 14:56:12.158430 2014] [proxy_ajp:error] [pid 12094:tid 140002805655296] (70007)The timeout specified has expired: [client 10.4.65.146:58551] AH00878: read response failed from 10.4.3.33:8009 (tomcatworkerX), referer: http://xxxx/yyy/
[Tue Jan 07 14:56:12.229559 2014] [proxy_balancer:error] [pid 12094:tid 140002932012800] [client 10.4.230.138:57407] AH01167: balancer://lb: All workers are in error state for route (tomcatworkerX), referer: http://xxxx/yyy/zzz

Users that go down see the "Server Unavailable" screen, but the connect restores after a few minutes. Sometimes however the same server connection goes up/down many times; this could be due to user behavior on the same (I use sticky sessions) but I haven't been able to confirm this.

My configuration is that I have a single Apache Webserver instance running in a Windows environment, with 4 Tomcat workers configured via AJP. All Tomcat workers, currently, are hosted under Windows on separate hosts.

All hosts in my scenario are VMs in a robust production environment, with multiple cores devoted to each.

Apache Version:

Server version: Apache/2.2.22 (Win32)

Tomcat is version 7.0.29

Each BalancerMember has these configuration parameters:

keepalive=On timeout=600 ttl=600

Each Tomcat instance currently using the native connector (org.apache.coyote.ajp.AjpAprProtocol).

Connector config:

<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" maxThreads="450" connectionTimeout="600000" />

The application itself connects to Oracle via the Oracle ojdbc15_g JDBC driver, v11.2.0.3.0.

Things that I have observed:

It does not appear that the Tomcat server is getting overrun with requests from Apache. This is from observations of log activity, as well as verifying through the Apache Webserver server-status data, bolstered by thread activity via jconsole. (I never see the number of execution threads increase anywhere near the limit I set above). This is an internal application servicing ~400 users, most of whom aren't on at the same times; so the load shouldn't be the issue.
I don't appear to have any thread deadlocking issues ... when monitoring the Tomcat instances remotely using jconsole, and I look at the ajp-apr-8009-exec-# threads to confirm this, and most are in a wait state while some I can see actively processing.
We DO have some long-running requests - some that, at times, will exceed beyond the 600s timeouts I outlined above. This is an area I'm currently exploring now; the reasons for the length of the requests is usually a federated search on a very large data-store that simply takes time, though usually returns within seconds. When it takes longer, it's typically due to a poorly constructed keyword search by the user that is causing Oracle to block for quite a while while it builds the results. Currently I'm refactoring this so that it runs in a separate thread from the request/apr exec thread, and if it takes longer than 280s (4 minutes 40 seconds) to execute it will kill the thread and throw an error back to the user; this way I can rule out Tomcat taking too long to process a request.

As to what I am doing now to try and address it, I feel I've exhausted my ability to configure (which includes googling for every possible solution online, as I'm a software guy by trade, not infrastructure) .. so I'm trying a different tact by switching platforms: I have run up Apache Webserver on a Linux machine, and using a DNS round-robin a portion of users get routed through Linux rather than Windows. This doesn't appear to have helped, but the Tomcat workers are still running on the same Windows boxes.

I'm currently getting the Tomcat app itself up on a Linux machine as well, and when I have that stable (some minor code changes are necessary due to assumptions about Windows being the only platform the app would be hosted on) I will add that as a worker to see if that particular instance encounters the same issues.

If nothing else, I'd like confirmation that my suspicion about the long-executing requests is the right path. I've tried various configuration changes to no avail.

Apache Webserver, Tomcat, AJP, "All workers are in error state for route"

Answer

Related questions