Escaping Nagios Hell: Infrastructure Monitoring at Scale for High-Traffic Nodes
It is 3:14 AM. Your pager buzzes. It’s the third time this week. The alert reads: CRITICAL: Load Average > 10.0. By the time you SSH into the box, the load has dropped back to 0.8. The site is up. The logs are clean. You go back to sleep, angry and tired. If this sounds familiar, your monitoring strategy is broken.
In the Norwegian hosting market, where we pride ourselves on stability and the robustness of the NIX (Norwegian Internet Exchange) infrastructure, having a flaky monitoring setup is inexcusable. Whether you are running a Magento cluster or a high-traffic media server, the standard "install Nagios and pray" approach stops working once you pass 50 nodes. The signal-to-noise ratio destroys your team's morale.
I have spent the last six months refactoring the infrastructure for a major Oslo-based e-commerce client. We moved from a monolithic Nagios setup to a distributed Zabbix architecture backed by Graphite for trending. Here is how we did it, and why the underlying hardware—specifically the virtualization technology—matters more than your check scripts.
The Enemy is Not Load, It is I/O Wait
Most default monitoring templates are useless. They alert on CPU usage or Load Average without context. In a virtualized environment, high load is often a symptom of I/O bottlenecks, not CPU exhaustion. If you are hosting on cheap OpenVZ containers, you are familiar with the "noisy neighbor" effect.
When a neighbor on the same physical host starts a backup or a heavy compile, your disk I/O creates a backlog. Your CPU waits for data. The load spikes. Your monitoring screams.
To detect this, stop looking at top. Start looking at %iowait and %steal.
Diagnosing the Bottleneck
Use iostat (part of the sysstat package on CentOS 6/Ubuntu 12.04) to see what is really happening.
# Install sysstat if you haven't already
yum install sysstat -y
# Watch disk stats every 2 seconds
iostat -x 2
If you see an output like this, you have a problem:
avg-cpu: %user %nice %system %iowait %steal %idle
4.50 0.00 2.10 45.20 12.40 35.80
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 4.00 15.00 45.00 450.00 3200.00 60.83 12.50 150.20 8.50 98.40
Analyze this:
- %iowait (45.20): The CPU is doing nothing but waiting for the disk. This is death for a database.
- %steal (12.40): The hypervisor is stealing cycles from you to give to another VM. This confirms you are on an oversold host.
- await (150.20): It takes 150ms to service an I/O request. Ideally, this should be sub-10ms on SSDs.
Pro Tip: If your %steal is consistently above 5%, move providers. No amount of software tuning fixes a greedy host. This is why at CoolVDS, we strictly use KVM virtualization with dedicated resource allocation. We don't oversell CPU cycles, so your monitoring reflects your load, not your neighbor's.
Moving from Boolean Checks to Trending
Nagios is binary: OK or CRITICAL. But infrastructure is analog. You need to know the rate of change. This is why we are seeing a massive shift towards Graphite for metrics collection in 2013.
Instead of just alerting when disk space is 90% full, we want to see the slope of the line. Is it filling up by 1% a day or 1% a minute?
Implementing a Custom Metric Check
Here is a robust bash script we use to push custom metrics to a Zabbix sender or Graphite wrapper. This specific script checks MySQL slave lag—a common pain point for redundant setups.
#!/bin/bash
# Check MySQL Slave Lag for Zabbix/Nagios
# Date: 2013-02-15
# Author: DevOps Team
USER="monitor"
PASS="SuperSecurePassword2013"
# Get Seconds_Behind_Master
LAG=$(mysql -u$USER -p$PASS -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" | awk '{ print $2 }')
# Check if NULL (Slave not running)
if [ "$LAG" == "NULL" ]; then
echo "CRITICAL: Replication is broken!"
exit 2
fi
# Performance Data for Graphing
echo "Slave Lag: $LAG seconds | lag=$LAG"
# Alert Thresholds
if [ "$LAG" -ge 300 ]; then
echo "CRITICAL: Slave is $LAG seconds behind master"
exit 2
elif [ "$LAG" -ge 60 ]; then
echo "WARNING: Slave is $LAG seconds behind master"
exit 1
else
echo "OK: Slave lagging by $LAG seconds"
exit 0
fi
The Zabbix Agent Configuration
To scale this, do not use SSH checks from your central server. That adds latency and security risks. Use the zabbix_agentd in active mode. This offloads the processing to the client node.
Edit /etc/zabbix/zabbix_agentd.conf on your client nodes:
# Basic optimization for Zabbix 2.0
Server=192.168.10.50
ServerActive=192.168.10.50
Hostname=web-node-01.oslo.coolvds.net
# Increase buffer to handle network blips without losing data
BufferSend=10
BufferSize=1000
# Timeout for custom scripts
Timeout=10
# Allow root commands if absolutely necessary (use with caution)
AllowRoot=0
# User parameters for custom keys
UserParameter=mysql.lag,/usr/local/bin/check_mysql_lag.sh
Legal & Latency: The Norwegian Advantage
Technical metrics are not the only thing you need to monitor. You need to monitor your compliance perimeter. With the Data Protection Directive (95/46/EC) governing how we handle European data, hosting outside the EEA is becoming a legal minefield. Datatilsynet (The Norwegian Data Protection Authority) is increasing scrutiny on where personal data physically resides.
Furthermore, latency is simple physics. If your customers are in Oslo or Bergen, routing traffic through a datacenter in Frankfurt or Amsterdam adds 20-30ms of round-trip time (RTT). In high-frequency trading or real-time gaming, that is an eternity.
| Origin | Destination | Avg Latency (ms) |
|---|---|---|
| Oslo (Home/Office) | CoolVDS Oslo DC | 2-5 ms |
| Oslo | Frankfurt (AWS/Generic) | 25-35 ms |
| Oslo | US East (Virginia) | 90-110 ms |
The Hardware Foundation
You cannot monitor your way out of bad hardware. We recently audited a client complaining about random Zabbix timeout alerts. Their scripts were timing out because the disk subsystem on their previous budget host was choking on IOPS.
We migrated them to CoolVDS instances running on Enterprise SSDs in RAID 10. The result? The monitoring alerts stopped overnight. Not because we changed the thresholds, but because the checks could actually complete instantly.
We use KVM (Kernel-based Virtual Machine). Unlike OpenVZ, KVM provides true hardware virtualization. Your memory is your memory. Your kernel is your kernel. This isolation is critical for accurate monitoring. If you see a load spike on CoolVDS, it is because your application is busy, not because we oversold the node.
Final Thoughts
Effective monitoring in 2013 requires moving beyond simple "Is it up?" checks. You need granularity. You need to distinguish between application load and I/O wait. And most importantly, you need infrastructure that respects the metrics you are collecting.
Don't let false positives ruin another night's sleep. Ensure your baseline is solid. Deploy a test instance on CoolVDS today, run your own iostat benchmarks, and see the difference true KVM isolation makes.