The Silence Before the Crash
It’s 3:17 AM. Your phone buzzes. It's not a text from a friend; it's an angry client asking why the webshop is returning a 502 Bad Gateway. You stumble to your laptop, SSH in, and find the server load is at 50.0. Apache is deadlocked. The database is crying.
If you had proper monitoring, you would have seen the warning signs three days ago.
In the world of systems administration, silence isn't golden—it's suspicious. Today, we are going to fix this using the two most reliable tools in the open-source arsenal: Nagios for immediate alerts and Munin for historical trending. Whether you are running a single VPS or a cluster of dedicated servers, this setup is non-negotiable.
The Philosophy: State vs. Trend
Many junior admins confuse the two. Here is the distinction:
- Nagios answers the question: "Is it broken right now?" It is binary. It wakes you up.
- Munin answers the question: "When did it start getting slow?" It draws graphs. It helps you diagnose the root cause.
You need both. Nagios tells you the disk is full; Munin shows you the graph of the disk filling up over the last week so you can catch it before it hits 100% next time.
Part 1: The Watchdog (Nagios 3 on CentOS 5)
Nagios 3 is the industry standard for a reason. It is ugly, complex to configure, and absolutely rock solid. While newer tools try to be flashy, Nagios just works.
First, install the necessary packages. I prefer using the EPEL repository for CentOS 5, as compiling from source is a waste of billable hours.
yum install nagios nagios-plugins-all nagios-plugins-nrpe
Configuring the Contacts
The biggest mistake I see is alerting the wrong people. Open /etc/nagios/objects/contacts.cfg. Do not send critical alerts to a generic 'info@' email that nobody checks until Monday morning.
define contact{
contact_name sysadmin_on_call
use generic-contact
alias Battle Hardened Admin
email [email protected]
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
}
Pro Tip: Use the check_nrpe plugin to execute checks locally on remote servers. Checking a port from the outside tells you the firewall is open. Checking the process table locally tells you if MySQL is actually running or just a zombie process.
Part 2: The Historian (Munin)
Munin uses RRDTool (Round Robin Database) to store data. It is fantastic for spotting memory leaks or slow I/O degradation. Installing the node on your target server is straightforward:
yum install munin-node
chkconfig munin-node on
service munin-node start
Then, on your master server, add the node to /etc/munin/munin.conf:
[db.coolvds.no]
address 10.0.0.5
use_node_name yes
The Hidden I/O Killer
Here is where cheap hosting providers fail you. Munin generates a lot of small write operations. Every 5 minutes, it updates hundreds of .rrd files. On a standard VPS with oversold storage, this creates "iowait". Your monitoring tool ends up slowing down the very server it is supposed to watch.
We see this constantly with clients migrating to us. They try to run Munin on a budget VPS and the graph lines start breaking because the disk creates too much latency.
The CoolVDS Advantage: Hardware Matters
Software configuration can only save you so much. If the underlying spindles are slow, your database locks up. At CoolVDS, we don't play the "overselling" game common in the budget market.
Our infrastructure uses enterprise-grade 15k RPM SAS RAID-10 arrays. Unlike standard SATA drives used by budget hosts, 15k SAS drives offer vastly superior random I/O performance. This means you can run intensive RRDTool updates for Munin alongside your high-traffic MySQL database without the disk queue spiking.
Norwegian Reliability
For our clients in Oslo and the greater Nordic region, latency is king. Hosting your monitoring server outside the country introduces network jitter that leads to false positives in Nagios. By placing your infrastructure in our Oslo datacenter, connected directly to NIX (Norwegian Internet Exchange), you ensure that an alert is a real problem, not just a hiccup in a trans-Atlantic fiber cable.
Furthermore, keeping your data within Norway ensures compliance with the Personal Data Act (Personopplysningsloven). Even server logs contain IP addresses, which Datatilsynet considers personal data. Don't risk it by hosting on a budget box in Texas.
Final Configuration Checks
Before you close your SSH session, verify your firewall allows the monitoring server to talk to the nodes. In iptables, you need to allow port 5666 (NRPE) and 4949 (Munin) only from your monitoring IP.
-A RH-Firewall-1-INPUT -s 192.168.1.10 -p tcp -m state --state NEW -m tcp --dport 5666 -j ACCEPT
-A RH-Firewall-1-INPUT -s 192.168.1.10 -p tcp -m state --state NEW -m tcp --dport 4949 -j ACCEPT
Monitoring is the difference between a professional and an amateur. It gives you the confidence to deploy on a Friday (though we still don't recommend that).
Need a rock-solid foundation for your monitoring stack? Deploy a CoolVDS Xen instance today. With our 15k SAS storage and gigabit uplink to NIX, you’ll never miss a heartbeat.