I'm running CF 9.0.1 on Ubuntu on an "Medium" Amazon EC2 instance. CF has been seizing-up intermittently (several times per day...but notably not isolated to hours of peak usage). At such times, running top gets me this (or something similar):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND
15855 wwwrun 20 0 1762m 730m 20m S 99.3 19.4 13:22.96 coldfusion9
So, it's obviously consuming most of the server resources. The following error has been showing up in my cfserver.log in the lead-up to each seize-up:
java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.
If I run /opt/coldfusion9/bin/coldfusion status, I get:
Pg/Sec DB/Sec CP/Sec Reqs Reqs Reqs AvgQ AvgReq AvgDB Bytes Bytes
Now Hi Now Hi Now Hi Q'ed Run'g TO'ed Time Time Time In/Sec Out/Sec
0 0 0 0 -1 -1 150 25 0 0 -1352560 0 0
In the administrator, under Server Settings > Request Tuning, the setting for Maximum number of simultaneous Template requests is 25. So this makes sense so far. I could just increase the thread pool to cover these sort of load spikes. I could make it 200. (Which I did just now as a test.)
However, there's also this file /opt/coldfusion9/runtime/servers/coldfusion/SERVER-INF/jrun.xml. And some of the settings in there appear to conflict. For example, it reads:
<service class="jrunx.scheduler.SchedulerService" name="SchedulerService">
<attribute name="bindToJNDI">true</attribute>
<attribute name="activeHandlerThreads">25</attribute>
<attribute name="maxHandlerThreads">1000</attribute>
<attribute name="minHandlerThreads">20</attribute>
<attribute name="threadWaitTimeout">180</attribute>
<attribute name="timeout">600</attribute>
</service>
Which a) has fewer active threads (what does this mean?), and b) has a max threads that exceed the simultaneous request limit set in the admin. So, I'm not sure. Are these independent configs that need to be made to match manually? Or is the jrun.xml file supposed to be written by the CF Admin when changes are made there? Hmm. But maybe this is different because presumably the CF Scheduler should only use a subset of all available threads, right?...so we'd always have some threads for real live users? We also have this in there:
<service class="jrun.servlet.http.WebService" name="WebService">
<attribute name="port">8500</attribute>
<attribute name="interface">*</attribute>
<attribute name="deactivated">true</attribute>
<attribute name="activeHandlerThreads">200</attribute>
<attribute name="minHandlerThreads">1</attribute>
<attribute name="maxHandlerThreads">1000</attribute>
<attribute name="mapCheck">0</attribute>
<attribute name="threadWaitTimeout">300</attribute>
<attribute name="backlog">500</attribute>
<attribute name="timeout">300</attribute>
</service>
This appears to have changed when I changed the CF Admin setting...maybe...but it's the activeHandlerThreads that matches my new maximum simulataneous requests setting...rather than the maxHandlerThreads, which again exceeds it. Finally, we have this:
<service class="jrun.servlet.jrpp.JRunProxyService" name="ProxyService">
<attribute name="activeHandlerThreads">200</attribute>
<attribute name="minHandlerThreads">1</attribute>
<attribute name="maxHandlerThreads">1000</attribute>
<attribute name="mapCheck">0</attribute>
<attribute name="threadWaitTimeout">300</attribute>
<attribute name="backlog">500</attribute>
<attribute name="deactivated">false</attribute>
<attribute name="interface">*</attribute>
<attribute name="port">51800</attribute>
<attribute name="timeout">300</attribute>
<attribute name="cacheRealPath">true</attribute>
</service>
So, I'm not certain which (if any) of these I should change and what exactly the relationship is between maximum requests and maximum threads. Also, since several of these list the maxHandlerThreads as 1000, I'm wondering if I should just set the maximum simultaneous requests to 1000. There must be some upper limit that depends on available server resources...but I'm not sure what it is and I don't really want to play around with it since it's a production environment.
I'm not sure if it pertains to this issue at all, but when I run a ps aux | grep coldfusion I get the following:
wwwrun 15853 0.0 0.0 8704 760 pts/1 S 20:22 0:00 /opt/coldfusion9/runtime/bin/coldfusion9 -jar jrun.jar -autorestart -start coldfusion
wwwrun 15855 5.4 18.2 1678552 701932 pts/1 Sl 20:22 1:38 /opt/coldfusion9/runtime/bin/coldfusion9 -jar jrun.jar -start coldfusion
There are always these two and never more than these two processes. So there does not appear to be a one-to-one relationship between processes and threads. I recall from an MX 6.1 install I maintained for many years that additional CF processes were visible in the process list. It seemed to me at the time like I had a process for each thread...so either I was wrong or something is quite different in version 9 since it's reporting 25 running requests and only showing these two processes. If a single process can have multiple threads in the background, then I'm given to wonder why I have two processes instead of one?...just curious.
So, anyway, I've been experimenting while composing this post. As noted above I adjusted the maximum simultaneous requests up to 200. I was hoping this would solve my problem, but CF just crashed again (rather it slogged down and requests started timing out...so effectively "crashed"). This time, top looked similar (still consuming more than 99% of the CPU), but CF status looked different:
Pg/Sec DB/Sec CP/Sec Reqs Reqs Reqs AvgQ AvgReq AvgDB Bytes Bytes
Now Hi Now Hi Now Hi Q'ed Run'g TO'ed Time Time Time In/Sec Out/Sec
0 0 0 0 -1 -1 0 150 0 0 0 0 0 0
Obviously, since I'd increased the maximum simultaneous requests, it was allowing more requests to run simultaneously...but it was still maxing out the server resources.
Further experiments (after restarting CF) showed me that the server became unusably slogged after about 30-35 "Reqs Run'g", with all additional requests headed for an inevitable timeout:
Pg/Sec DB/Sec CP/Sec Reqs Reqs Reqs AvgQ AvgReq AvgDB Bytes Bytes
Now Hi Now Hi Now Hi Q'ed Run'g TO'ed Time Time Time In/Sec Out/Sec
0 0 0 0 -1 -1 0 33 0 0 -492 0 0 0
So, it's clear that increasing the maximum simultaneous requests has not helped. I guess what it comes down to is this: What is it having such a hard time with? Where are these spikes coming from? Bursts of traffic? On what pages? What requests are running at any given time? I guess I simply need more information to continue troubleshooting. If there are long-running requests, or other issues, I'm not seeing it in the logs (although I do have that option checked in the admin). I need to know which requests exactly are those responsible for these spikes. Any help would be much appreciated. Thanks.
~Day
I've had a number of 'high-cpu in production' type bugs and the way i've always dealt with them is this:
Use jstack PID >> stack.log to dump 5 of stack traces, 5 seconds apart. Number of traces and timing not critical.
Open the log in Samurai. You get a view of the threads at each time you did a dump. Threads processing your code start web- (for requests using the built-in server) and jrpp- for requests coming in through Apache/IIS.
Read the history of each thread. You're looking for the stack being very similar in each dump. If a thread looks like it's handling the same request the whole time, the bits that vary near the top will point to where an infinite loop is happening.
Feel free to dump a stack trace somewhere online and point us to it.
The other technique I've used to understand what's going on is to modify apache's httpd.conf to log time taken: %D and record session id: %{jsessionid} which allows you to trace individual users in the run-up to hangs and to do some nice stats/graphs with the data (I use LogParser to crunch the numbers and output to CSV, followed by Excel to graph the data):
LogFormat "%h %l %u %t "%r" %>s %b %D %{jsessionid}" customAnalysis
CustomLog logs/analysis_log customAnalysis
One other technique I've just remembered is to enable CF Metrics, which will get you some measure of what the server was up to in the runup to a hang. I set this to log every 10 seconds and change the format to be CSV, so I can grep the metrics from the event log and then run them through Excel to graph server load in the runup to crashes.
Barny