If you are running a validator or a public node, you should set up system monitoring to be aware if your node goes offline, and restart it automatically if possible.
This tutorial explains how to set up monit and mmonit to accomplish this:
Monit is a process monitoring tool, which can restart your node if it stalls.
Mmonit is a dashboard that shows the performance (CPU, memory, alerts, etc.)
of monit nodes.
(If you don't need a dashboard, you can skip directly to the section for setting up monit on an individual node.)
The tutorial assumes you are running Ubuntu 18.04.
Setting up an mmonit monitoring server
For the monitoring server, a small server should be enough (e.g. 1GB memory). The default setup will use SQLite as a database. This means if you remove the monit directory, all logged events will be lost!
echo 'set httpd port 2812 and use address localhost'
echo ' allow localhost'
echo ' allow '$USER:$PASSWORD
} > /etc/monit/monitrc
chmod 600 /etc/monit/monitrc
Refer to the monit documentation if you would like to set up email alerts or other additional checks on your node.
Monitoring API health
So far, monit will only check to ensure that the node does not remain at 100% CPU for extended periods of time. We can now add another check that uses the API to ensure the node is accepting connections and syncing properly.
This will catch some failure scenarios where the node appears to keep operating but stops recognizing new blocks. We provide nodeup for this purpose.
Clone nodeup into the /root directory. Follow its installation instructions:
Make sure your node is running with the --rpc-cors="*" flag, so WebSocket connections are accepted. Note that by default, this will not make the node accept connections from outside the local machine (--ws-external is required for that).
If it is working, you can now add these lines to your monit configuration at /etc/monit/monitrc (they assume that nodeup is installed in the /root/nodeup directory, adjust accordingly if not):
check program nodeup with path "/root/nodeup/index.js -u ws://localhost:9944"
if status > 0 for 10 cycles then exec "/bin/systemctl stop edgeware" and repeat every 10 cycles
Restart monit, and check that the new script is working: