Supervisor fails to restart half of the time

Paul K. picture Paul K. · Sep 23, 2015 · Viewed 12.9k times · Source

I'm trying to deploy a Django app using Uwsgi and supervisor on a machine running Debian 8.1.

When I restart via sudo systemctl restart supervisor it fails to restart half of the time.

$ root@host:/# systemctl start supervisor
    Job for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
$ root@host:/# systemctl status supervisor.service
    ● supervisor.service - LSB: Start/stop supervisor
       Loaded: loaded (/etc/init.d/supervisor)
       Active: failed (Result: exit-code) since Wed 2015-09-23 11:12:01 UTC; 16s ago
      Process: 21505 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
      Process: 21511 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
    Sep 23 11:12:01 host supervisor[21511]: Starting supervisor:
    Sep 23 11:12:01 host systemd[1]: supervisor.service: control process exited, code=exited status=1
    Sep 23 11:12:01 host systemd[1]: Failed to start LSB: Start/stop supervisor.
    Sep 23 11:12:01 host systemd[1]: Unit supervisor.service entered failed state.

However there is nothing in the supervisor or uwsgi logs. Supervisor 3.0 is running with this configuration for uwsgi :

[program:uwsgi]
stopsignal=QUIT
command = uwsgi --ini uwsgi.ini
directory = /dir/
environment=ENVIRONMENT=STAGING
logfile-maxbytes = 300MB

stopsignal=QUIT has been added because UWSGI ignores the default signal (SIGTERM) on stop and gets killed brutally with SIGKILL leaving orphan workers.

Is there a way I could investigate what's happening ?

EDIT:

Tried as mnencia advised : /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start but it still fails half of the time.

 root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
    [ ok ] Stopping supervisor (via systemctl): supervisor.service.
    ● supervisor.service - LSB: Start/stop supervisor
       Loaded: loaded (/etc/init.d/supervisor)
       Active: inactive (dead) since Tue 2015-11-24 13:04:32 UTC; 89ms ago
      Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
      Process: 23349 ExecStart=/etc/init.d/supervisor start (code=exited, status=0/SUCCESS)

    Nov 24 13:04:30 xxx supervisor[23349]: Starting supervisor: supervisord.
    Nov 24 13:04:30 xxx systemd[1]: Started LSB: Start/stop supervisor.
    Nov 24 13:04:32 xxx systemd[1]: Stopping LSB: Start/stop supervisor...
    Nov 24 13:04:32 xxx supervisor[23490]: Stopping supervisor: supervisord.
    Nov 24 13:04:32 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
    [....] Starting supervisor (via systemctl): supervisor.serviceJob for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
     failed!
    root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
    [ ok ] Stopping supervisor (via systemctl): supervisor.service.
    ● supervisor.service - LSB: Start/stop supervisor
       Loaded: loaded (/etc/init.d/supervisor)
       Active: failed (Result: exit-code) since Tue 2015-11-24 13:04:32 UTC; 1s ago
      Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
      Process: 23526 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)

Nov 24 13:04:32 xxx systemd[1]: supervisor.service: control process exited, code=exited status=1
Nov 24 13:04:32 xxx systemd[1]: Failed to start LSB: Start/stop supervisor.
Nov 24 13:04:32 xxx systemd[1]: Unit supervisor.service entered failed state.
Nov 24 13:04:32 xxx supervisor[23526]: Starting supervisor:
Nov 24 13:04:33 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
[ ok ] Starting supervisor (via systemctl): supervisor.service.

Answer

mnencia picture mnencia · Nov 23, 2015

This is not necessarily an error from supervisor. I see from your systemctl status output that supervisor is started through the sysv-init compatibility layer, so the failure could be in the /etc/init.d/supervisor script. It would explain the absence of errors in the supervisord logs.

To debug the init script, the easiest way is to add a set -x as first non-comment instruction in that file, and look in the journalctl output the trace of the script execution.

EDIT:

I've reproduced and debugged it on a test system with Debian Sid.

The issue is that the stop target of the supervisor init-script does not check if the daemon has been really terminated but only send a signal if the process exists. If the daemon process takes a while to shutdown, the subsequent start action will fail due to the dying daemon process, which is counted as already running.

I've opened a bug on Debian Bug Tracker: http://bugs.debian.org/805920

WORKAROUND:

You can workaround the issue with:

/etc/init.d/supervisor force-stop && \
/etc/init.d/supervisor stop && \
/etc/init.d/supervisor start
  • force-stop will ensure the supervisord has been terminated (outside systemd).
  • stop make sure systemd know it's terminated
  • start starts it again

The stop after the force-stop is required otherwise systemd will ignore any subsequent start request. stop and start can be combined using restart, but here I've put both of them to show how it works.