Zero Downtime: Mastering Munin and Nagios for High-Traffic Infrastructure
It’s 3:42 AM. Your phone buzzes. It’s not a text from a friend; it’s your server screaming. Your MySQL process just died, your load average is 50.0, and you have no idea why. If you are running production workloads without granular monitoring, you aren't a SysAdmin; you're a gambler.
In the hosting world—especially here in Norway where reliability is expected—uptime is the only currency that matters. I’ve seen too many developers deploy code to a "black box" server, only to panic when traffic spikes. Today, we are going to fix that. We are building a monitoring stack that tells you before the crash happens using the industry standards: Nagios for alerting and Munin for resource graphing.
The Stack: Why Nagios and Munin?
You might ask, "Why two tools?" Because they solve two different problems.
- Nagios is binary. Is the web server up? Yes/No. Is disk space under 90%? Yes/No. If "No," it pages you.
- Munin is historical. It answers: "Why did the server crash at 4:00 AM?" by showing you a graph of memory usage creeping up since Tuesday.
Step 1: The Setup (CentOS 5/6)
We assume you are running a clean VPS Norway instance. I recommend CoolVDS for this because their upstream connectivity to NIX (Norwegian Internet Exchange) ensures your monitoring alerts aren't delayed by network jitter. If you are on a budget host with high latency, you will get false positives.
First, we need the EPEL repository, as standard CentOS repos don't carry these tools.
rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
yum update -y
yum install nagios nagios-plugins-all munin munin-node httpd
Step 2: Configuring Nagios for Real Alerts
The default config is useless. You need to define who gets yelled at. Open the contacts configuration:
vi /etc/nagios/objects/contacts.cfg
Look for the email directive. Do not just put your work email here. If your mail server is on the same box that crashes, you won't get the alert. Use an external provider or an SMS gateway email alias.
define contact{
contact_name sysadmin
use generic-contact
alias Battle Hardened Admin
email [email protected]
}
Next, verify your configuration and start the service. Don't forget chkconfig or it won't survive a reboot.
/usr/bin/nagios -v /etc/nagios/nagios.cfg
service nagios start
service httpd start
chkconfig nagios on
chkconfig httpd on
Step 3: Munin for the "Why"
Nagios tells you the house is on fire; Munin shows you who lit the match. The magic of Munin is the plugins. It auto-detects what you are running (Apache, MySQL, Exim) and generates graphs.
Edit /etc/munin/munin-node.conf to allow your monitoring master to pull data. If you are running a single box, allow ^127\.0\.0\.1$ is fine. If you have a cluster, add the master IP.
Pro Tip: Pay close attention to the "Disk I/O" and "MySQL Slow Queries" graphs. On shared hosting, high I/O wait usually means a noisy neighbor. This is why we migrated our core databases to CoolVDS; their Xen-based isolation and high-performance RAID arrays mean 0% steal time, even during peak hours.
The "Steal Time" Trap
One metric most admins ignore is CPU Steal Time. In your Munin CPU graph, if you see a large "purple" area (steal), it means the hypervisor is starving your VM of cycles. This is common with oversold budget providers.
If your monitoring shows high steal time, no amount of my.cnf tuning will save you. You need better hardware. CoolVDS guarantees dedicated CPU cycles, so your monitoring reflects your actual load, not the load of the teenager running a Minecraft server on the same node.
Compliance and Data Location
Working in the Nordic market, we have strict requirements regarding the Personal Data Act (Personopplysningsloven). While Safe Harbor exists for US transfers, the Norwegian Data Inspectorate (Datatilsynet) looks favorably on keeping data within national borders. By hosting your monitoring data—which often contains sensitive IP addresses and server names—on servers physically located in Oslo, you simplify your compliance posture significantly.
Final Thoughts
Monitoring is not optional. It is the difference between a minor hiccup and a business-ending outage. Set up Nagios today to catch the failures, and let Munin run for a week to establish your baseline.
If you are tired of debugging slow performance only to find out it's your host's slow disks, it's time to move. Deploy a test instance on CoolVDS; their low latency network and solid I/O performance make them the reference platform for serious SysAdmins in 2011.