Sleep Through the Night: Bulletproof Server Monitoring with Munin and Nagios on CentOS 6
It’s 3:42 AM on a Tuesday. Your phone buzzes on the nightstand. It's not a text from a friend; it's a frantic email from a client. Their Magento store is down. You grab your laptop, squinting at the screen, and SSH into the box. It’s sluggish. Top shows load average spiking, but why? Is it a brute force attack? A memory leak? Or did the backup script lock the database tables?
If you don't have historical data, you are just guessing. And guessing gets you fired.
In the world of high-availability hosting, silence is not golden—it’s terrifying. Unless you are monitoring your infrastructure, you aren't an administrator; you're a firefighter waiting for the arsonist. Today, we break down the classic, battle-tested duo for server omniscience: Munin for graphing trends and Nagios for alerting. We will deploy this on a CentOS 6 environment, the standard for enterprise stability right now.
The Philosophy: The Historian and The Watchdog
You need two distinct types of monitoring. Conflating them is a rookie mistake.
- The Historian (Munin): Munin paints pictures. It uses RRDTool to graph CPU, memory, IO, and network usage over days, weeks, and months. It answers the question: "When did the disk usage start climbing?"
- The Watchdog (Nagios): Nagios screams. It checks services (HTTP, SMTP, Disk Space) every few minutes. It answers the question: "Is the web server alive right now?"
When hosting on a VPS, specifically within the Norwegian infrastructure where latency to the NIX (Norwegian Internet Exchange) is measured in single-digit milliseconds, you need to know if a bottleneck is your application or the network.
Step 1: Installing Munin on CentOS 6
First, we need the EPEL (Extra Packages for Enterprise Linux) repository. Munin isn't in the base CentOS repos.
# Install EPEL repo
rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-5.noarch.rpm
# Install Munin and the node
yum install munin munin-node
Once installed, we need to configure the node. The munin-node service runs on the server being monitored. If you are running a single CoolVDS instance, the server monitors itself. For a cluster, you have one master and many nodes.
Open /etc/munin/munin-node.conf:
# /etc/munin/munin-node.conf
log_level 4
log_file /var/log/munin/munin-node.log
pid_file /var/run/munin/munin-node.pid
background 1
setsid 1
user root
group root
# Allow localhost to connect
allow ^127\.0\.0\.1$
# If you have a separate monitoring server, add its IP here:
# allow ^192\.168\.1\.50$
Now, start the service and ensure it runs on boot. We aren't using upstart for everything yet, so good old init scripts apply.
/etc/init.d/munin-node start
chkconfig munin-node on
After about 10 minutes, check /var/www/html/munin via your browser. You should see graphs populating. If you see empty images, check permissions on the directory.
Pro Tip: Munin's default disk plugins can be I/O intensive on lower-tier VPS platforms caused by "noisy neighbors." Because CoolVDS uses strict KVM isolation and high-speed storage, you can run the aggressive iostat_ios plugin without degrading your web server's performance.
Step 2: Configuring Nagios for Instant Alerts
Munin is great for post-mortem analysis, but Nagios wakes you up before the site dies. Installing Nagios 3 from source or repo is straightforward.
yum install nagios nagios-plugins-all
The magic happens in the object configuration. Let's define a check for our SSH service to ensure we haven't locked ourselves out. Edit /etc/nagios/objects/localhost.cfg:
define service{
use local-service
host_name localhost
service_description SSH
check_command check_ssh
notifications_enabled 1
}
But the real killer is disk space. A full disk corrupts MySQL tables faster than you can say "restore from backup."
define service{
use local-service
host_name localhost
service_description Root Partition
check_command check_local_disk!20%!10%/
}
This configures a warning at 20% free space and a critical alert at 10%. Don't be the admin who ignores the yellow warning only to be hit by the red critical error during dinner.
Verify your configuration before restarting:
nagios -v /etc/nagios/nagios.cfg
/etc/init.d/nagios restart