Enabling systemd watchdog for thin-edge services
Introduction​
The systemd watchdog feature enables systemd to detect when a service is unhealthy or unresponsive and attempt to fix it by restarting that service. To detect if a service is healthy or not, systemd relies on periodic health notifications from that service at regular intervals. If the service fails to send that notification within a time threshold, then systemd will assume that service to be unhealthy and restart it.
This document describes how the systemd watchdog mechanism can be enabled for thin-edge services.
Enabling the systemd watchdog feature for a tedge service​
Enabling systemd watchdog for a thin-edge.io
service (tedge-agent, tedge-mapper-c8y/az/collectd) is a two-step process.
Step 1: Enable the watchdog feature in the systemd service file​
For example, to enable the watchdog feature for tedge-mapper-c8y
service,
update the systemd service file as shown below:
The systemd service file for tedge services are usually present in /lib/systemd/system
directory,
like /lib/systemd/system/tedge-mapper-c8y.service
.
Add tedge-watchdog.service
as an After
service dependency under [Unit]
section.
Add the watchdog interval as WatchdogSec=30
under [Service]
section.
Update the restart condition as Restart=always
under [Service]
section.
Here is the updated service file for tedge-mapper-c8y
service:
[Unit]
Description=tedge-mapper-c8y converts Thin Edge JSON measurements to Cumulocity JSON format.
After=syslog.target network.target mosquitto.service tedge-watchdog.service
[Service]
User=tedge-mapper
ExecStart=/usr/bin/tedge-mapper c8y
Restart=always
RestartPreventExitStatus=255
WatchdogSec=30
Step 2: Start the tedge-watchdog service​
The tedge-watchdog
service is responsible for periodically checking the health of
all tedge services for which the watchdog feature is enabled,
and send systemd watchdog notifications on their behalf to systemd.
Start and enable the tedge-watchdog
service as follows:
sudo systemctl enable tedge-watchdog
sudo systemctl start tedge-watchdog
Once started, the tedge-watchdog
service will keep checking the health of the monitored tedge services by periodically
sending health check messages to them within their configured WatchdogSec
interval.
The health check request for service is published to te/device/main/service/<service-name>/cmd/health/check
topic and
the health status response from that service is expected on te/device/main/service/<service-name>/status/health
topic.
Once the health status response is received from a particular service, the tedge-watchdog
service will send the
systemd notification to systemd on behalf of that
monitored service.
If none of the thin-edge
services are enabled with the watchdog feature, then the tedge-watchdog
service will stop with an inactive
state.
To monitor any of the thin-edge
services, one has to update the corresponding systemd
service file with WatchdogSec
and then restart the tedge-watchdog
service.
Debugging​
One can observe the message exchange between the service
and the watchdog
by subscribing to the following topics:
- tedge
- mosquitto
- mqtt
tedge mqtt sub 'te/+/+/+/+/status/health'
mosquitto_sub -t 'te/+/+/+/+/status/health'
te/+/+/+/+/status/health
- tedge
- mosquitto
- mqtt
tedge mqtt sub 'te/+/+/+/+/cmd/health/check'
mosquitto_sub -t 'te/+/+/+/+/cmd/health/check'
te/+/+/+/+/cmd/health/check
Check out the Monitor tedge health for more details about the health endpoint.
If the watchdog service does not send the notification to the systemd within WatchdogSec
interval for a service,
then systemd restarts that service by killing the old process and spawning a new one to replace it.
Here is an example about using systemd watchdog
feature.