Console Login

Silence is Deadly: A Battle-Hardened Guide to Nagios & Munin Monitoring in 2011

Silence is Deadly: A Battle-Hardened Guide to Nagios & Munin Monitoring

There is no sound more terrifying to a systems administrator than silence. No emails, no tickets, no phone calls. Just a quiet void where your traffic used to be. Then, the storm hits. Your CEO is screaming because the webshop has been returning a 503 error since 4:00 AM, and you slept right through it.

If you are running production workloads without granular monitoring, you aren't an admin; you're a gambler. And the house always wins.

In the Norwegian hosting market, where latency to the NIX (Norwegian Internet Exchange) in Oslo is measured in single-digit milliseconds, downtime is unacceptable. Today, we are going to build a monitoring stack that actually works. We aren't talking about expensive enterprise bloatware. We are talking about the industry standards: Nagios Core 3 for alerting and Munin for trending.

The Dichotomy: Ambulance vs. MRI

A common mistake I see junior admins make is confusing the purpose of these two tools. They try to make Nagios graph things, or they rely on Munin to alert them.

  • Nagios is the ambulance. It screams when something is broken. "CPU load is above 10.0!" or "MySQL is unreachable!" It is binary. Up or Down. Critical or OK.
  • Munin is the MRI. It shows you the history. It tells you, "Your disk I/O wait has been creeping up by 2% every day for the last month." It helps you predict the crash before Nagios has to scream.

You need both. Running one without the other is flying blind.

Step 1: The Nagios Watchtower

We'll assume you are running a standard CentOS 5.6 or the new CentOS 6.0 stack. First, get the EPEL repositories. Don't compile from source unless you enjoy dependency hell.

rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
yum install nagios nagios-plugins-all nrpe

The magic happens in /etc/nagios/objects/. You need to define your host. A trick I use for high-availability setups is to monitor the private interface for service health and the public interface for connectivity. This helps distinguish between a network partition and a server crash.

Here is a battle-tested service definition for checking MySQL health specifically for high-traffic Norwegian e-commerce sites:

define service{
    use                     generic-service
    host_name               web01.oslo.local
    service_description     MySQL_Connectivity
    check_command           check_mysql_cmdlinecredits!nagios!secretpassword
    notifications_enabled   1
    contact_groups          admins
}

Don't just check if port 3306 is open. Check if you can actually log in. I've seen plenty of servers where the port is open, but the mysqld process is zombie-locked.

Step 2: Munin for Capacity Planning

Nagios wakes you up. Munin lets you sleep. By graphing your resources, you can see that you'll run out of inodes in three weeks. That gives you time to plan maintenance.

On your target node (the client):

yum install munin-node
chkconfig munin-node on

Edit /etc/munin/munin-node.conf. Security is paramount here. Only allow the IP of your monitoring server. The Datatilsynet (Data Inspectorate) takes a dim view of leaking server metrics, which can inadvertently reveal traffic patterns or user density.

allow ^127\.0\.0\.1$
# Allow our central monitoring server IP
allow ^192\.168\.1\.50$

The "I/O Wait" Trap

Here is a scenario from last week. A client moved from a dedicated server to a cheap VPS provider. Suddenly, their Munin graphs showed "CPU usage" spiking. They thought they needed more cores.

I looked at the Munin graph closer. It wasn't User CPU. It was I/O Wait. The physical disks on the host node were thrashing. The VPS provider had oversold the spindle speed. Adding vCPUs fixes nothing when your bottleneck is the hard drive head seeking across a platter.

Pro Tip: Always check the diskstats plugin in Munin. If `iowait` consistently exceeds 10-15%, your storage backend is too slow for your application. No amount of caching configuration in `my.cnf` will save you from bad hardware.

The Hardware Reality

This brings us to infrastructure choice. You can have the best Nagios config in the world, but if your underlying host is unstable, you are just monitoring your own misery.

This is why we architect CoolVDS differently. We don't use container-based virtualization like OpenVZ for serious workloads, because "noisy neighbors" can steal your I/O. We use KVM (Kernel-based Virtual Machine) which is becoming the gold standard in RHEL 6. It provides true isolation.

Furthermore, we are beginning to roll out Solid State Drive (SSD) storage tiers. Unlike traditional SAS 15k RPM drives, SSDs virtually eliminate seek time. For a database heavy application, the difference is night and day. While SSD storage is still a premium technology in 2011, for high-performance VPS Norway needs, it is the only logical path forward.

Compliance and Latency

Hosting in Norway isn't just about patriotism; it's about physics and law.
1. Latency: If your customers are in Oslo or Bergen, hosting in Germany or the US adds 30-100ms of lag. For TCP handshakes, that adds up.
2. Privacy: Under the Personal Data Act (Personopplysningsloven), you are responsible for where your user data lives. Keeping it within Norwegian borders simplifies compliance with Datatilsynet audits.

Final Thoughts

Set up Nagios to catch failures. Set up Munin to prevent them. And ensure your underlying infrastructure isn't fighting against you.

If you are tired of debugging I/O wait on oversold hardware, it’s time to upgrade. CoolVDS offers managed hosting and unmanaged KVM instances with the new standard of high-speed storage. We keep the lights on so you don't have to.

Deploy a test instance on CoolVDS today and see what 2ms latency to NIX looks like.