My company Nucleics has an array of servers distributed around the world to support our PeakTrace Basecaller. For historical reasons these servers are a mix of CentOS 6/7 VPS and physical servers supplied by three different companies. While the Auto PeakTrace RP application is designed to be robust in the face of server downtime, I wanted a dead simple monitoring service that would fix 99% of the server problem automatically and only contact me if there was something really wrong. After looking around all the paid services I settled on using a combination of Monit and Pushover.
Monit is an open source watchdog utility that can monitor other Linux services and automatically restart them if they crash or stop working. The great thing about monit is that you can set it up to fix things on its own. For example, if the server can be fixed by simply restarting apache then I want the monitoring service to just do this and only send me a message if something major has happened. I also wanted a service that would ping my phone, but where I could easily control it (i.e turn on/off, set away times, etc).
Pushover looked ideal for doing this. For a one off cost of $5 you can use the Pushover API to send up to 7500 message a month to any phone. It has lots of other nice features like quiet times and group notification. It comes with a 7 day free trial so you have time to make sure everything is going to work with your system before paying.
The only issue with integrating monit and pushover is that by default monit is set to email alert notices. Most of our servers don’t have the ability to email (they are slimmed down and are only running the services needs to support PeakTrace). Luckly, monit can also execute scripts so I settled on the alternative approach of calling the Pushover API via an alert script that would pass through exactly what server and service was having problems. This alert script is set to only be called if monit cannot fix the problem by restarting the service. After a bit of experimentation I got the whole system running rather nicely.
Here is the step-by-step guide. I did all this logged in as root, but if you don’t like to live on the edge just put sudo in front of every command.
Setting up Pushover
After registering an account at Pushover, and downloading the appropriate app for your phone (iOS or android), you need to set up a new pushover application on the Pushover website.
Click on Register an Application/Create an API Token. This will open the Create New Application/Plugin page.
- Give the application a name (I called it Monit), but you call it anything you like.
- Choose “script” as the type.
- Add a description (I called it Monit Server Monitoring).
- Leave the url field blank.
- If you want you can add an icon, but you don’t need to do this. It is nice though having an icon when you get a message.
- Press the Create Application button.
You need to record the new application API Token/Key as well as your Pushover User Key (you can find this on the main pushover page if you are logged in). You will need both these keys to have monit be able to ping Pushover via the alert script.
Install the EPEL package repository.
# yum install -y epel-release
Install monit and curl.
# yum install -y monit curl
Set monit to start on boot and start monit.
# chkconfig monit on && service monit start
You can edit the monif.conf file in /etc but the default values are fine. Take a look at the monit man page for more details about what you might want to change.
Create the Pushover Alert Script
You need to create the script that monit will call when it raises an alert.
# nano /usr/local/bin/pushover.sh
Paste the following text substituting your own API Token and User Keys before saving.
#!/bin/bash /usr/bin/curl -s --form-string "token=API Token" \ --form-string "user=User Key" \ --form-string "message=[$MONIT_HOST] $MONIT_SERVICE - $MONIT_DESCRIPTION" \ https://api.pushover.net/1/messages.jsonop
Make the script executable.
# chmod 700 /usr/local/bin/pushover.sh
Test that the script works. If there are no issues the script will return without error and you will get an short message in the Pushover phone app almost immediately.
Once you have the pushover.sh alert script set up you need to create all the service-specific monit .conf files. You can mix and match these to suit the services you are running on your server. The aim is to have monit restart the service if there are any issues and only if this does not solve the problem, call the pullover.sh alert script. This way most servers will fix themselves and you only get contacted if something catastrophic has happened.
# nano /etc/monit.d/system.conf check system $HOST if loadavg (5min) > 4 then exec "/usr/local/bin/pushover.sh" if loadavg (15min) > 2 then exec "/usr/local/bin/pushover.sh" if memory usage > 80% for 4 cycles then exec "/usr/local/bin/pushover.sh" if swap usage > 20% for 4 cycles then exec "/usr/local/bin/pushover.sh" if cpu usage (user) > 90% for 4 cycles then exec "/usr/local/bin/pushover.sh" if cpu usage (system) > 80% for 4 cycles then exec "/usr/local/bin/pushover.sh" if cpu usage (wait) > 80% for 4 cycles then exec "/usr/local/bin/pushover.sh" if cpu usage > 200% for 4 cycles then exec "/usr/local/bin/pushover.sh"
# nano /etc/monit.d/apache.conf check process httpd with pidfile /var/run/httpd/httpd.pid start program = "/etc/init.d/httpd start" with timeout 60 seconds stop program = "/etc/init.d/httpd stop" if children > 250 then restart if loadavg(5min) greater than 10 for 8 cycles then exec "/usr/local/bin/pushover.sh" if failed port 80 for 2 cycles then restart if 3 restarts within 5 cycles then exec "/usr/local/bin/pushover.sh"
# nano /etc/monit.d/sshd.conf check process sshd with pidfile /var/run/sshd.pid start program "/etc/init.d/sshd start" stop program "/etc/init.d/sshd stop" if failed port 22 protocol ssh then restart if 5 restarts within 5 cycles then exec "/usr/local/bin/pushover.sh"
# nano /etc/monit.d/fail2ban.conf check process fail2ban with pidfile /var/run/fail2ban/fail2ban.pid start program "/etc/init.d/fail2ban start" stop program "/etc/init.d/fail2ban stop" if 5 restarts within 5 cycles then exec "/usr/local/bin/pushover.sh"
# nano /etc/monit.d/syslog.conf check process rsyslog with pidfile /var/run/syslogd.pid start program "/etc/init.d/rsyslog start" stop program "/etc/init.d/rsyslog stop" if 5 restarts within 5 cycles then exec "/usr/local/bin/pushover.sh"
# nano /etc/monit.d/crond.conf check process crond with pidfile /var/run/crond.pid start program "/etc/init.d/crond start" stop program "/etc/init.d/crond stop" if 5 restarts within 5 cycles then exec "/usr/local/bin/pushover.sh"
# nano /etc/monit.d/mysql.conf check process mysqld with pidfile /var/run/mysqld/mysqld.pid start program = "/etc/init.d/mysqld start" stop program = "/etc/init.d/mysqld stop" if failed host 127.0.0.1 port 3306 then restart if 5 restarts within 5 cycles then exec "/usr/local/bin/pushover.sh"
Check that all the .conf file are correct
# monit -t
If everything is fine then start monitoring by loading the new .conf files.
# monit reload
Check the status of monit by using
# monit status
This should give you something like this depending on which services you are monitoring.
The Monit daemon 5.14 uptime: 3d 20h 17m System 'rps.peaktraces.com' status Running monitoring status Monitored load average [0.00] [0.12] [0.11] cpu 0.2%us 0.1%sy 0.0%wa memory usage 106.6 MB [10.7%] swap usage 0 B [0.0%] data collected Tue, 19 Jul 2016 04:16:06 Process 'rsyslog' status Running monitoring status Monitored pid 1016 parent pid 1 uid 0 effective uid 0 gid 0 uptime 4d 23h 33m children 0 memory 3.4 MB memory total 3.4 MB memory percent 0.3% memory percent total 0.3% cpu percent 0.0% cpu percent total 0.0% data collected Tue, 19 Jul 2016 04:16:06 Process 'sshd' status Running monitoring status Monitored pid 1176 parent pid 1 uid 0 effective uid 0 gid 0 uptime 4d 23h 33m children 4 memory 1.2 MB memory total 20.7 MB memory percent 0.1% memory percent total 2.0% cpu percent 0.0% cpu percent total 0.0% port response time 0.006s to [localhost]:22 type TCP/IP protocol SSH data collected Tue, 19 Jul 2016 04:16:06 Process 'fail2ban' status Running monitoring status Monitored pid 1304 parent pid 1 uid 0 effective uid 0 gid 0 uptime 4d 23h 33m children 0 memory 30.2 MB memory total 30.2 MB memory percent 3.0% memory percent total 3.0% cpu percent 0.1% cpu percent total 0.1% data collected Tue, 19 Jul 2016 04:16:06 Process 'crond' status Running monitoring status Monitored pid 1291 parent pid 1 uid 0 effective uid 0 gid 0 uptime 4d 23h 33m children 0 memory 1.2 MB memory total 1.2 MB memory percent 0.1% memory percent total 0.1% cpu percent 0.0% cpu percent total 0.0% data collected Tue, 19 Jul 2016 04:16:06 Process 'httpd' status Running monitoring status Monitored pid 20963 parent pid 1 uid 0 effective uid 0 gid 0 uptime 4h 5m children 2 memory 7.7 MB memory total 19.0 MB memory percent 0.7% memory percent total 1.9% cpu percent 0.0% cpu percent total 0.0% data collected Tue, 19 Jul 2016 04:16:06
You may want to adjust the system.conf values if your server is under sustained high loads so as to scale back on the pushover triggers. Since you will know exactly what is the trigger this is quite easy to do.
To create a monit .conf file for a new services you just need to make sure that you use the correct .pid file path for the service and that the start and stop paths are correct. These can be a little non-obvious (look at syslog.conf for example). If you do make a mistake monit -t and monit status will show you what is wrong.
Once you have all this in place then sit back, relax and let the servers take care of themselves (well we can all dream).
Edit July 2017. I have been using this system for over a year now and it has been working great. I have had no problem that monit has not fixed by itself by just restarting the service. About the only issue I have had is load spikes on the server caused by a runaway service not monitored.
I have recently used the same approach to monitor for unauthorised logins which I wrote up Dead simple ssh login monitoring with Monit and Pushover.