Systemd http health check

clausfod picture clausfod · Sep 24, 2016 · Viewed 7.7k times · Source

I have a service on a Redhat 7.1 which I use systemctl start, stop, restart and status to control. One time the systemctl status returned active, but the application "behind" the service responded http code different from 200.

I know that I can use Monit or Nagios to check this and do the systemctl restart - but I would like to know if there exist something per default when using systemd, so that I do not need to have other tools installed.

My preferred solution would be to have my service restarted if http return code is different from 200 totally automatically without other tools than systemd itself - (and maybe with a possibility to notify a Hipchat room or send a email...)

I've tried googling the topic - without luck. Please help :-)

Answer

Charles Duffy picture Charles Duffy · Sep 24, 2016

The Short Answer

systemd has a native (socket-based) healthcheck method, but it's not HTTP-based. You can write a shim that polls status over HTTP and forwards it to the native mechanism, however.


The Long Answer

The Right Thing in the systemd world is to use the sd_notify socket mechanism to inform the init system when your application is fully available. Use Type=notify for your service to enable this functionality.

You can write to this socket directly using the sd_notify() call, or you can inspect the NOTIFY_SOCKET environment variable to get the name and have your own code write READY=1 to that socket when the application is returning 200s.

If you want to put this off to a separate process that polls your process over HTTP and then writes to the socket, you can do that -- ensure that NotifyAccess is set appropriately (by default, only the main process of the service is allowed to write to the socket).


Inasmuch as you're interested in detecting cases where the application fails after it was fully initialized, and triggering a restart, the sd_notify socket is appropriate in this scenario as well:

Send WATCHDOG_USEC=... to set the amount of time which is permissible between successful tests, then WATCHDOG=1 whenever you have a successful self-test; whenever no successful test is seen for the configured period, your service will be restarted.