Stop Letting Nagios Kill Your Sleep: Scaling Infrastructure Monitoring in 2013
It is 3:14 AM. The pager goes off. The SMS reads: CRITICAL: Load Average > 10.0 on db-slave-04. You stumble out of bed, SSH in, run top, and see... nothing. The load is 0.4. The spike is gone. You go back to sleep, only to be woken up again at 3:45 AM. Rinse and repeat.
If you manage infrastructure in Norway, you know that while our power grid is stable, the "shared resource" lies told by many hosting providers are not. I have spent the last six months refactoring the monitoring stack for a high-traffic e-commerce platform targeting the Nordic market. We learned the hard way that the traditional Nagios polling model simply collapses when you hit scale.
Here is how we moved from reactive panic to proactive metrics using Graphite, Collectd, and honest KVM virtualization.
The Problem: The "Poller" Bottleneck
Most sysadmins start with Nagios. It is the industry standard. You define a service check, and the Nagios server connects to the client (via NRPE or SSH), executes a script, and gets a text string back: OK or CRITICAL.
This works for 10 servers. At 200 servers, checking 50 metrics each, every 60 seconds, you are executing 10,000 SSH connections or TCP handshakes a minute. Your monitoring server becomes a bottleneck. Worse, you have no granularity. You know the CPU was high, but you don't know the slope of the curve.
Pro Tip: If you are still using check_mk or standard NRPE for granular metrics, stop. Use them for status (is it alive?), not for performance data (how fast is it?).
The Solution: Push vs. Pull (Enter Graphite)
Instead of the server asking "How are you?", the clients should shout "Here are my stats!" asynchronously. In 2013, the most robust stack for this is Collectd (the agent) sending data to Graphite (the storage/renderer).
This architecture is non-blocking. If the metrics server goes down, your production database doesn't hang waiting for a socket timeout. It just drops the UDP packets and keeps serving queries.
Configuring Collectd for High-Resolution Metrics
On a CentOS 6 box (standard for enterprise), installing Collectd is straightforward via EPEL. The magic happens in the /etc/collectd.conf. We want to capture the write_graphite plugin to stream data to our central monitoring node.
LoadPlugin "interface"
LoadPlugin "load"
LoadPlugin "memory"
LoadPlugin "write_graphite"
Host "10.10.0.50"
Port "2003"
Protocol "tcp"
LogSendErrors true
Prefix "servers."
Postfix ""
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
This configuration pushes metrics instantly. You get near real-time graphs. But gathering metrics is useless if the underlying hardware is lying to you.
The "Steal Time" Trap and the KVM Advantage
Here is where the choice of hosting provider becomes an architectural decision, not just a billing one. In virtualized environments, %st (steal time) in top is the percentage of time your virtual CPU waits for a real CPU while the hypervisor is servicing another tenant.
I recently diagnosed a Magento installation that was sluggish despite low reported load. The culprit? OpenVZ containers. The provider had oversold the physical cores. Our container reported low CPU usage, but the instructions were simply queued, waiting for the neighbor to finish their PHP processing.
To accurately monitor performance, you need hardware isolation. This is why we deploy strictly on KVM (Kernel-based Virtual Machine) instances, like those provided by CoolVDS. KVM exposes the kernel directly. If I run iostat on a CoolVDS instance, the numbers match the physical reality of the disk controller.
Validating I/O Performance
When your database slows down, the first command you should run isn't checking the slow query logβit is checking if the disk is choking. Use iostat from the sysstat package.
$ iostat -x 1 10
Pay attention to the %util and await columns.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 4.00 0.00 46.00 0.00 400.00 8.70 0.04 0.91 0.91 4.20
If await (average time for I/O requests to be served) is consistently above 10ms, your disk is the bottleneck. On traditional spinning HDDs (even SAS 15k), this happens fast. We migrated our DB master to CoolVDS's SSD-backed storage, and await dropped to sub-1ms levels. In 2013, SSD storage is not a luxury for databases; it is a requirement.
Network Latency: The NIX Factor
Monitoring isn't just about CPU and Disk; it's about the wire. For Norwegian users, latency to the Norwegian Internet Exchange (NIX) in Oslo is paramount. If your server is hosted in Germany or the US, you are adding 30-100ms of round-trip time (RTT) to every request.
We use a custom Nagios plugin to check latency specifically to local endpoints to ensure our peering remains optimal. Here is a simplified bash wrapper we use:
#!/bin/bash
# check_latency_nix.sh
TARGET="193.156.90.1" # NIX peering point
WARN=10
CRIT=20
LATENCY=$(ping -c 4 $TARGET | tail -1| awk -F '/' '{print $5}')
# Bash does not handle floats well, using bc
IS_CRIT=$(echo "$LATENCY > $CRIT" | bc)
IS_WARN=$(echo "$LATENCY > $WARN" | bc)
if [ "$IS_CRIT" -eq 1 ]; then
echo "CRITICAL - Latency to NIX is ${LATENCY}ms"
exit 2
elif [ "$IS_WARN" -eq 1 ]; then
echo "WARNING - Latency to NIX is ${LATENCY}ms"
exit 1
else
echo "OK - Latency to NIX is ${LATENCY}ms"
exit 0
fi
Running this check from a server in Oslo usually returns <2ms. If you are seeing higher, your routing is inefficient.
Data Integrity and "The Law"
We cannot talk about infrastructure without touching on the Norwegian Personal Data Act (Personopplysningsloven). While the EU is debating new regulations, the current Directive 95/46/EC is clear: you are responsible for where your data lives. Monitoring logs often contain IP addresses, which can be considered PII.
Centralizing logs with rsyslog or syslog-ng is standard, but ensure transmission is encrypted. We wrap our log shipping in stunnel if the native encryption support isn't available on legacy clients.
# /etc/rsyslog.conf example for shipping to central server
$PreserveFQDN on
*.* @@monitoring.local:514
Conclusion
Scale exposes cracks in your architecture that you never see on a dev box. By moving from polling to pushing with Graphite, and upgrading from spinning rust to SSD-backed KVM instances, we reduced our false-positive pager alerts by 90%.
Infrastructure should be boring. It should be predictable. If you are tired of debugging "ghost" load spikes caused by noisy neighbors on oversold OpenVZ containers, it is time to upgrade.
Deploy a KVM instance on CoolVDS today and see what 0.5ms disk latency actually feels like.