Console Login

Stop Guessing: Architecting Bulletproof Server Monitoring with Nagios & Munin on CentOS 6

Stop Guessing: Architecting Bulletproof Server Monitoring with Nagios & Munin on CentOS 6

I have a simple rule for every junior sysadmin that joins my team: If it isn't monitored, it doesn't exist.

We recently took over a cluster for a large e-commerce client in Oslo. They were complaining about "random" downtime. The previous provider—a generic budget host—blamed the application. I logged in, installed htop, and saw the steal time hovering at 40%. The hypervisor was oversold. But worse? They had zero visibility. No historical graphs, no alert thresholds, nothing. They were flying blind in a blizzard.

In the Nordic hosting market, where we rely on the stability of NIX (Norwegian Internet Exchange) and expect sub-millisecond latency within Oslo, guessing is professional suicide. Today, we are going to build a monitoring stack that actually works. We aren't using heavy, expensive enterprise suites. We are using the industry standards: Nagios Core 3.4 for alerting and Munin for trending.

The Architecture: Centralized Eyes, Distributed Agents

Don't run your monitoring on the same server you are monitoring. That is like putting the fire alarm inside the safe. You need a dedicated VPS. For this setup, a small instance on CoolVDS is perfect because we need reliability, not raw power. The architecture looks like this:

  • The Watchtower (Monitor Host): Runs Nagios Core and Munin Master.
  • The Nodes (Client Hosts): Runs NRPE (Nagios Remote Plugin Executor) and Munin-Node.

Step 1: The Watchtower Setup (CentOS 6.3)

First, we need the EPEL repositories. Standard repositories are too conservative.

rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm yum install nagios nagios-plugins-all nrpe munin httpd php -y

The core of Nagios is the configuration. Most people give up here because the config files are pedantic. Good. Rigor saves uptime. We need to define a generic template for our Linux servers to avoid repeating ourselves.

Edit /etc/nagios/objects/templates.cfg:

define host{ name linux-server use generic-host check_period 24x7 check_interval 5 retry_interval 1 max_check_attempts 10 check_command check-host-alive notification_period 24x7 notification_interval 120 notification_options d,u,r contact_groups admins register 0 }

Step 2: The "Battle-Hardened" Check Script

Standard plugins check disk space and load. That is kid stuff. We need to check for disk I/O bottlenecks and MySQL replication lag. If you are running a high-traffic site on a VPS, I/O wait is your enemy.

Here is a custom Bash script for the client nodes to check disk write latency. Save this as /usr/lib64/nagios/plugins/check_disk_latency.sh:

#!/bin/bash # Check average wait time for I/O requests # Warning at 50ms, Critical at 100ms DEVICE=$1 WARN=$2 CRIT=$3 if [ -z "$DEVICE" ]; then echo "Usage: $0 device warn crit" exit 3 fi # Get await from iostat (requires sysstat package) AWAIT=$(iostat -x -d $DEVICE 1 2 | grep $DEVICE | tail -1 | awk '{print $10}' | cut -d. -f1) if [ $AWAIT -ge $CRIT ]; then echo "CRITICAL - Disk Latency is ${AWAIT}ms | latency=${AWAIT}ms;$WARN;$CRIT;;" exit 2 elif [ $AWAIT -ge $WARN ]; then echo "WARNING - Disk Latency is ${AWAIT}ms | latency=${AWAIT}ms;$WARN;$CRIT;;" exit 1 else echo "OK - Disk Latency is ${AWAIT}ms | latency=${AWAIT}ms;$WARN;$CRIT;;" exit 0 fi

Don't forget to chmod +x it. This script relies on sysstat, which you should have installed anyway.

Pro Tip: On CoolVDS instances, we utilize high-performance SSD arrays (RAID-10). You will rarely see this latency check spike unless you are pushing massive sequential writes. However, on legacy mechanical drives or oversold hosts, this alert will save your life before the database locks up.

Step 3: Wiring NRPE

On the client node (the server you want to watch), install NRPE:

yum install nrpe nagios-plugins-all -y chkconfig nrpe on

You must configure /etc/nagios/nrpe.cfg to allow the Watchtower IP. Security first—don't open this to the world.

allowed_hosts=127.0.0.1,192.168.1.50 # Replace with your Monitoring VPS IP command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20 command[check_root_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p / command[check_custom_latency]=/usr/lib64/nagios/plugins/check_disk_latency.sh vda 50 100

Restart NRPE: service nrpe restart.

Why Munin? (Because RRDTool never lies)

Nagios tells you if it's broken now. Munin tells you why it broke an hour ago. Munin graphs trends.

Configuring the Munin node is trivial. Edit /etc/munin/munin-node.conf:

allow ^192\.168\.1\.50$ # Regex for the master IP

The real magic of Munin is identifying "noisy neighbors" or resource leaks. If you see your memory usage "sawtooth" every 20 minutes, you have a cron job misbehaving. If you see a gradual incline over a week, you have a memory leak in Java or PHP-FPM.

Handling the Local Context: Data Privacy

Operating in Norway means adhering to the Personal Data Act (Personopplysningsloven). When you configure logging for Nagios or Munin, ensure you aren't logging PII (Personally Identifiable Information) in the alert outputs. For example, if you monitor a log file for 500 errors, strip the IP addresses before sending the alert email.

This is where hosting local matters. Keeping your monitoring data and your production data within Norwegian borders (or at least the EEA) simplifies compliance with Datatilsynet requirements. CoolVDS infrastructure is physically located here, ensuring your data governance strategy remains intact.

The Hardware Reality Check

You can script monitoring all day, but software cannot fix bad physics. In 2013, we are seeing a shift. Spinning rust (HDD) is dead for database hosting. If your `check_disk_latency` script is constantly firing, no amount of MySQL tuning in `my.cnf` will fix it.

We are seeing the emergence of enterprise SSDs and early PCIe flash storage solutions. While expensive, the ROI on not waking up at 3:00 AM is infinite. We built the CoolVDS platform on KVM virtualization specifically to give you direct access to these hardware benefits without the abstraction tax of older containers like Virtuozzo.

Troubleshooting Connection Issues

If Nagios says CHECK_NRPE: Error - Could not complete SSL handshake, 99% of the time it is IPTables. Don't disable the firewall; configure it properly.

iptables -I INPUT -p tcp -m tcp --dport 5666 -s 192.168.1.50 -j ACCEPT service iptables save

Final Thoughts

Complexity is the enemy of reliability. Start with these three checks: Ping, Disk Space, and HTTP Response. Once those are stable, add the latency script I provided above. Do not rely on your cloud provider's "status page"—it is always the last thing to update.

If you need a sandbox to test this configuration without risking your production environment, spin up a small instance. Low latency connectivity to the NIX makes testing these configurations instant.

Ready to harden your infrastructure? Deploy a KVM instance on CoolVDS today and get full root access to build the monitoring stack your business deserves.