I've run into a really strange problem which I am not sure how to debug further. I have an NGINX + PHP5-FPM + APC Amazon Ubuntu instance and there is a website installed on it which is a complex PHP framework. While trying to debug the problem, I've reduced the flow to this: a lot of big classes get included, main objects are created, session is started, array of configs is retrieved from memcached, an XML file is retrieved from memcached, HTML templates are included, output is sent to the client.
Then I use http_load
tool to put the website under the load of 20 requests per second: http_load -timeout 10 -rate 20 -fetches 10000 ./urls.txt
What happens next is rather strange. top
shows a bunch of php5-fpm processes spawned each taking a few % of CPU and everything runs smoothly, like this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28440 www-data 20 0 67352 10m 5372 S 4.3 1.8 0:20.33 php5-fpm
28431 www-data 20 0 67608 10m 5304 S 3.3 1.8 0:16.77 php5-fpm
28444 www-data 20 0 67352 10m 5372 S 3.3 1.8 0:17.17 php5-fpm
28445 www-data 20 0 67352 10m 5372 S 3.0 1.8 0:16.83 php5-fpm
28422 www-data 20 0 67608 10m 5292 S 2.3 1.8 0:18.99 php5-fpm
28424 www-data 20 0 67352 10m 5368 S 2.0 1.8 0:16.59 php5-fpm
28438 www-data 20 0 67608 10m 5304 S 2.0 1.8 0:17.91 php5-fpm
28439 www-data 20 0 67608 10m 5304 S 2.0 1.8 0:23.34 php5-fpm
28423 www-data 20 0 67608 10m 5292 S 1.7 1.8 0:20.02 php5-fpm
28430 www-data 20 0 67608 10m 5300 S 1.7 1.8 0:15.77 php5-fpm
28433 www-data 20 0 67352 10m 5372 S 1.7 1.8 0:17.08 php5-fpm
28434 www-data 20 0 67608 10m 5292 S 1.7 1.8 0:18.56 php5-fpm
20648 memcache 20 0 51568 8192 708 S 1.3 1.3 2:51.06 memcached
28420 www-data 20 0 69876 13m 6300 S 1.3 2.3 0:20.89 php5-fpm
28421 www-data 20 0 67608 10m 5300 S 1.3 1.8 0:21.19 php5-fpm
28429 www-data 20 0 9524 2260 992 S 1.3 0.4 0:11.68 nginx
28435 www-data 20 0 67608 10m 5304 S 1.3 1.8 0:18.58 php5-fpm
28437 www-data 20 0 67352 10m 5372 S 1.3 1.8 0:17.87 php5-fpm
28441 www-data 20 0 67608 10m 5292 S 1.3 1.8 0:20.75 php5-fpm
Then after some time which can be anywhere between one second and minutes, several (usually two) php5-fpm processes suddenly consume all the CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28436 www-data 20 0 67608 10m 5304 R 48.5 1.8 0:23.68 php5-fpm
28548 www-data 20 0 67608 10m 5276 R 45.2 1.7 0:07.62 php5-fpm
28434 www-data 20 0 67608 10m 5292 R 2.0 1.8 0:23.28 php5-fpm
28439 www-data 20 0 67608 10m 5304 R 2.0 1.8 0:26.63 php5-fpm
At this point everything gets stuck and all new HTTP requests timeout. If I stop http_load tool, the php5-fpm will hang there for many minutes. Interestingly enough if I do php5-fpm stop
, the php5-fpm processes will disappear but any commands that make use of filesystem will have problems executing. E.g. if I try to download a file via ssh, top
will show the following, taking many minutes to initiate the actual download:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3298 sshd 20 0 7032 876 416 R 75.2 0.1 0:04.52 sshd
3297 sshd 20 0 7032 876 416 R 24.9 0.1 0:04.49 sshd
PHP error log usually has this:
[05-Dec-2012 20:31:39] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 8 children, there are 0 idle, and 58 total children
[05-Dec-2012 20:32:08] WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers), spawning 16 children, there are 0 idle, and 66 total children
Nginx error log is flooded with these entries:
2012/12/05 20:31:36 [error] 4800#0: *5559 connect() to unix:/dev/shm/php-fpm-www.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: ..., server: ec2-....compute-1.amazonaws.com, request: "GET /usa/index.php?page=contact_us HTTP/1.0", upstream: "fastcgi://unix:/dev/shm/php-fpm-www.sock:", host: "ec2-....compute-1.amazonaws.com"
PHP-FPM slow log doesn't show anything interesting, swapping never happens and I didn't manage to gather any other interesting facts about the problem. I've gone through many iterations of config file changes, the most recent ones being
nginx.conf: http://pastebin.com/uaD56hJF
pool.d/www.conf: http://pastebin.com/mFeeUULC
===UPDATE 1===
site's config: http://pastebin.com/qvinVNhB
===UPDATE 2===
Also just found that dmesg
reports errors like this
[6483131.164331] php5-fpm[28687]: segfault at b6ec8ff4 ip b78c3c32 sp bff551f0 error 4 in ld-2.13.so[b78b5000+1c000]
===UPDATE 3===
We've got a new Amazon EC2 micro instance just in case, to exclude possible hardware issues. Also I am using php-fastcgi now to exclude possible fpm bugs. Other differences are minor, I think the only thing that change is Ubuntu->Debian. The same problem still happens except that now server manages to slightly recover after the max_execution_time seconds (and then spikes again).
I tried playing with a separate test.php and I am not sure if it's the same issue but at least in top
it looks the same. I created a test.php and included a bunch of libs that belong to our framework. The libs don't do anything except for defining classes or including other libs that define classes. I checked with APC and all of this gets successfully served by it. The I started pressuring test.php with 200 requests per second and after some time the same thing happened. Except that now I managed to get some errors saying "too many open files". It doesn't happen always though, sometimes it just starts timing out without outputting the error and a few php processes are stuck consuming all CPU. I played only a bit with it but I think there is a correlation here - by controlling the number of included libs or slightly varying requests/second rate, I can control when the CPU spike will happen. I increased the relevant OS variables but the issue is still there although it takes longer for it to happen (also note that I've set the limits to values N times larger than the total number of requests I do during tests).
fs.file-max = 70000
...
* soft nofile 10000
* hard nofile 30000
...
worker_rlimit_nofile 10000;
...
(reloaded all the configs and made sure the new system vars actually took affect)
So the next best and only explanation I can come up with so far is that even though APC is supposed to pull files from memory, internally it is implemented in a way that still uses a file descriptor whenever PHP include-s are called. And either because it releases them with a delay or when at some unfortunate moment too many requests arrive at the same moment, system runs our of descriptors and newly arriving HTTP requests get quickly stacked into a huge queue. I'll try to test this somehow.
I've run a website with similar configuration for many months, with zero down time. I've had a look at your config, and it looks ok. That being said, I did my config quite a while ago.
I would consider reducing pm.max_requests = 10000
to something more reasonable like pm.max_requests = 500
. This just means "don't use each instance for more than X number of requests". It is good not to have this number too high, because doing so gives you resilience with respect of possible PHP engine bugs.
I think the real problem is most likely in your PHP scripts. It's hard to say without knowing more.
EDIT:
Consider uncommenting ;request_terminate_timeout = 0
and setting it to something like request_terminate_timeout = 20
. Your scripts will then be required to complete within 20 seconds. You will most likely see a change in behaviour, but I think your site might stay live. That would indicate a PHP script error.
EDIT2: My own php-fpm config is as follows:
[example.com]
listen = /var/run/sockets/example.com.socket
user = www-data
group = www-data
pm = dynamic
pm.start_servers = 5
pm.max_children = 15
pm.min_spare_servers = 5
pm.max_spare_servers = 10
pm.max_requests = 500
php_flag[expose_php] = off
php_flag[short_open_tag] = on
EDIT3: I spotted something unexpected in your nginx config, but it may be nothing.
You are using fastcgi_ignore_client_abort on;
which causes problems in worker processes under older versions of nginx. I haven't seen this problem myself, since I am running a custom compile of a recent version. Here's the description of the problem on the nginx site:
In 1.0.2 POST requests are not handled correctly when fastcgi_ignore_client_abort is set to on which can lead to workers processes segfaulting. Switching fastcgi_ignore_client_abort back to default (off) should resolve this issue.