Console Login

Sleep Through the Night: The Art of Bulletproof Server Monitoring with Nagios & Munin

If You Can't Measure It, It's Already Broken

It’s 03:14 AM. The buzzing of your Nokia N97 on the nightstand isn't an alarm—it's a heartbeat failure notification. Your MySQL slave is desynchronized. Or maybe it isn't. Maybe it’s just a backup script running high I/O causing a timeout. You won't know until you drag yourself out of bed, SSH in, and run top.

There is a better way. In the hosting game, specifically here in the Nordic market where reliability is practically a religion, reactive administration is a career-ending move. You need to be proactive. You need to know the disk is filling up three weeks before it hits 100%.

Today, we aren't just installing software; we are building a telemetry deck using the two absolute standards of the industry: Nagios for alerting and Munin for trending. And we’re doing it on a CoolVDS Xen instance, because quite frankly, trying to monitor kernel metrics inside a cheap OpenVZ container is a fool's errand.

The Philosophy: Alerting vs. Trending

Many sysadmins confuse the two. Let's clarify:

  • Nagios (The Watchdog): Binary status. Is it UP or DOWN? Is load > 5.0? It screams when things break.
  • Munin (The Historian): Graphing. When did the load hit 5.0? Was it gradual or sudden? It whispers why things broke.

If you run a high-traffic site targeting Norwegian users, latency matters. You need to see if that spike in latency correlates with the backup job or a traffic surge from VG.no.

Step 1: The Watchdog (Nagios 3.2 on CentOS 5)

We prefer CentOS 5.4 for its stability. First, enable the EPEL repository if you haven't already, then pull the trigger.

[root@oslo-node-01 ~]# yum install nagios nagios-plugins-all nagios-plugins-nrpe
[root@oslo-node-01 ~]# chkconfig nagios on
[root@oslo-node-01 ~]# chkconfig httpd on

The magic isn't in the installation; it's in the configuration. The default nagios.cfg is too noisy. You want to avoid "flapping"—where a service toggles between OK and CRITICAL rapidly, flooding your inbox.

Tuning nagios.cfg for False Positives

Open /etc/nagios/nagios.cfg and look for these directives. We want to be generous with retries before waking you up.

enable_flap_detection=1
low_service_flap_threshold=20.0
high_service_flap_threshold=30.0
# Don't page me unless it's been down for 5 minutes
service_check_timeout=60

This is where infrastructure choice matters. On shared hosting or oversold VPS providers, "CPU Steal" (visible in top as %st) can trigger load alerts even when your VM is idle. This happens because the host node is overloaded. At CoolVDS, we use Xen paravirtualization with strict resource isolation. If Nagios says load is high on our servers, it's actually your traffic, not a noisy neighbor.

Step 2: The Historian (Munin)

Nagios tells you the server is on fire; Munin tells you who lit the match. Munin uses RRDTool to generate graphs. It is lightweight and Perl-based.

[root@oslo-node-01 ~]# yum install munin munin-node
[root@oslo-node-01 ~]# chkconfig munin-node on
[root@oslo-node-01 ~]# service munin-node start

The configuration file is at /etc/munin/munin-node.conf. You need to allow the master poller to connect. If you are running the master on the same host (common for standalone setups), the default allow ^127\.0\.0\.1$ is fine.

The "MySQL Slow Query" Trap

A common issue we see with Magento installs in Norway is database locking. To graph this, you need to symlink the MySQL plugins. But be careful: standard MyISAM tables lock the whole table for writes. If you haven't migrated to InnoDB yet, do it now. The performance difference on our RAID-10 SAS arrays is night and day.

ln -s /usr/share/munin/plugins/mysql_slowqueries /etc/munin/plugins/mysql_slowqueries
ln -s /usr/share/munin/plugins/mysql_threads /etc/munin/plugins/mysql_threads

Privacy and The "Datatilsynet" Angle

Operating in Norway means respecting the Personal Data Act (Personopplysningsloven). When you configure Nagios logging, be mindful of what you store. If your web server error logs (which Nagios might parse) contain IP addresses, you are processing personal data.

Pro Tip: Configure log rotation (logrotate) to purge raw HTTP logs after 30 days unless you have a specific legal requirement to keep them. It reduces disk usage and liability.

Why Infrastructure is the Best Monitoring Tool

You can tune Nagios until you are blue in the face, but you cannot software-patch bad hardware. Latency jitter, packet loss at the switch level, or disk I/O contention will make your graphs look like a heart attack victim.

This is why serious sysadmins deploy on CoolVDS.

  • Connectivity: We peer directly at NIX (Norwegian Internet Exchange) in Oslo. Your latency to Norwegian users is measured in single-digit milliseconds.
  • I/O Performance: We don't use single SATA drives. We run enterprise 15k RPM SAS drives in RAID-10. Your Munin graphs for "Disk Latency" will be boringly flat—exactly how you want them.
  • Reliability: Redundant power and cooling ensure that the only downtime you record is the downtime you scheduled.

Don't let a slow server ruin your reputation or your sleep. Deploy a test instance on CoolVDS today, set up your sensors, and finally enjoy the sound of silence.