Console Login

Silence the Pager: Comprehensive Server Monitoring with Nagios and Munin in 2012

Silence the Pager: Mastering Server Monitoring with Nagios and Munin

It’s 3:14 AM. Your phone buzzes. It’s not a text from a friend; it’s an automated SMS from your server farm. The database is down. Again.

If you manage infrastructure, you know this pain. It is the specific heartbreak of the systems administrator. The problem isn't usually that the server crashed; the problem is that you didn't see it coming. Reactive administration is a fast track to burnout. Proactive monitoring is how you keep your sanity—and your job.

In the Norwegian hosting market, where reliability is often mandated by strict SLAs and the Personopplysningsloven (Personal Data Act), flying blind is negligence. Today, we are going to build the holy grail of open-source monitoring: Nagios for immediate alerting and Munin for historical trending.

We will configure this on a standard CentOS 6.3 environment, typical of what you'd find on a high-performance CoolVDS instance.

The Philosophy: Alert vs. Trend

Many junior admins confuse the two. Here is the distinction:

  • Nagios is the watchdog. It asks: "Is the website up? Is the disk full?" If the answer is no, it screams at you.
  • Munin is the historian. It asks: "How fast was the disk filling up last Tuesday vs. today?"

You need both. Nagios wakes you up; Munin tells you what happened so you can fix it forever.

Part 1: The Watchdog (Nagios Core 3.4.1)

Nagios is the industry standard. It's ugly, the configuration files are arcane, and it is absolutely bulletproof. It doesn't need a fancy GUI to save your infrastructure.

Installation on CentOS 6

First, enable the EPEL repository if you haven't already. Then, install Nagios and the plugins.

yum install nagios nagios-plugins-all
chkconfig nagios on
service httpd start

Configuration Essentials

Nagios lives in /etc/nagios/. The most critical file is objects/contacts.cfg. If this is wrong, you won't get the alert.

define contact{
        contact_name                    sysadmin
        use                             generic-contact
        alias                           Battle Hardened Admin
        email                           alert@yourdomain.no
        service_notification_commands   notify-service-by-email
        host_notification_commands      notify-host-by-email
        }

Pro Tip: Do not just monitor PING. PING is a liar. A server can respond to a ping while MySQL is deadlocked and Apache is serving 500 errors. Monitor the service ports.

Here is a robust definition for checking a web server, to be placed in objects/localhost.cfg:

define service{
        use                             local-service
        host_name                       localhost
        service_description             HTTP
        check_command                   check_http
        notifications_enabled           1
        }

Part 2: The Historian (Munin 2.0)

Munin uses RRDTool to graph system metrics. It is lightweight and perfect for the Perl ecosystem found in most Linux distros.

Install the node (the agent) and the master (the grapher):

yum install munin munin-node
chkconfig munin-node on
service munin-node start

The "War Story": The Case of the Slow Leak

Last year, I managed a Magento cluster for a retail client in Oslo. Every Friday at 4 PM, the site would slow to a crawl. Nagios never alerted because the site wasn't down, just slow. Latency spiked from 200ms to 5s.

I looked at the Munin graphs. Specifically, the MySQL InnoDB Buffer Pool graph. It showed a "sawtooth" pattern. The buffer pool was filling up exactly 48 hours after every restart, forcing MySQL to swap to disk.

We were using standard SATA drives at the time. The IOPS (Input/Output Operations Per Second) hit the ceiling, and the CPU went into iowait hell.

The Fix: We tuned innodb_buffer_pool_size in my.cnf to 70% of available RAM and migrated the database to a CoolVDS instance backed by Enterprise SSDs. The graph flattened instantly. Without Munin, we would have been guessing.

Why Infrastructure Matters: The CoolVDS Advantage

Monitoring reveals the truth about your hosting provider. Install Munin on a cheap, oversold OpenVZ container, and look at the "CPU Steal" graph.

CPU Steal: The percentage of time your virtual CPU waits for the real CPU to serve another customer's virtual machine.

On budget hosts, you will often see CPU steal spikes of 10-20%. That is performance you are paying for but not getting. It causes sporadic latency that is impossible to debug code-wise.

At CoolVDS, we use KVM (Kernel-based Virtual Machine) virtualization. This provides stricter isolation. When we allocate a core to you, it's yours. Our Munin graphs for CPU Steal on client nodes consistently hover near 0.00%. Reliability isn't magic; it's physics and good architecture.

Writing a Custom Nagios Plugin

Sometimes the built-in plugins aren't enough. Maybe you need to check if a specific backup file exists and is larger than 1GB. Here is a simple Bash script that adheres to Nagios exit codes (0=OK, 1=WARNING, 2=CRITICAL).

#!/bin/bash
# /usr/lib64/nagios/plugins/check_backup.sh

FILE="/backup/daily.tar.gz"
MINSIZE=1048576 # 1GB in KB

if [ ! -f $FILE ]; then
    echo "CRITICAL - Backup missing!"
    exit 2
fi

ACTUALSIZE=$(du -k "$FILE" | cut -f1)

if [ $ACTUALSIZE -lt $MINSIZE ]; then
    echo "WARNING - Backup too small: ${ACTUALSIZE}k"
    exit 1
else
    echo "OK - Backup healthy: ${ACTUALSIZE}k"
    exit 0
fi

Make it executable and add it to your commands.cfg. It’s that simple.

Data Sovereignty and The Norwegian Context

Hosting in 2012 requires attention to legal frameworks. Under the Data Protection Directive 95/46/EC and local Norwegian law, you are responsible for where your user data lives. Latency is also a factor. Routing traffic from Oslo to a server in Texas is inefficient. Routing it to a server in Oslo via NIX (Norwegian Internet Exchange) ensures pings below 10ms.

CoolVDS data centers are located locally. We keep your data within the jurisdiction of Datatilsynet, ensuring you sleep soundly not just because your servers are up, but because you are compliant.

Final Thoughts

Tools like Nagios and Munin are useless if the underlying hardware is unstable. You can monitor a dying horse, but you can't make it win a race.

If you are tired of seeing "CPU Steal" in your graphs or dealing with high I/O wait times, it is time to upgrade.

Deploy a KVM-based instance on CoolVDS today. Experience the silence of a well-monitored, high-performance server.