Console Login

Stop Trusting "Up": The Reality of Infrastructure Monitoring in 2013

Stop Trusting "Up": The Reality of Infrastructure Monitoring in 2013

It is 3:00 AM on a Tuesday. The wind is howling outside your apartment in Oslo, and your phone is vibrating off the nightstand. Nagios says the Magento database server is "WARNING - Load Average: 15.02". By the time you SSH in, the load has dropped to 0.5. The site is up. The logs are clean. You go back to sleep, only to be woken up again at 4:30 AM.

This is the reality for most sysadmins managing virtual infrastructure today. We rely on binary indicators—Up or Down, Green or Red—while the actual performance bottlenecks hide in the gray areas of virtualization overhead and I/O latency. If you are running serious workloads, a simple TCP check on port 80 is professional negligence.

The "Steal Time" Ghost

In the current VPS market, overselling is the standard business model. Providers pile hundreds of OpenVZ containers onto a single host node. When your neighbor decides to compile a kernel or run a backup script, your performance tanks, but your monitoring system might not tell you why.

If you are running on virtualized hardware, the most critical metric you are probably ignoring is %st (Steal Time). This metric tells you how long your virtual CPU was forced to wait for the physical hypervisor to service it. If this goes above 5%, you are paying for CPU cycles you aren't getting.

Here is what you need to look for in top:

top - 14:22:05 up 14 days,  3:12,  1 user,  load average: 2.15, 2.00, 1.85
Tasks:  82 total,   1 running,  81 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.5%us,  3.2%sy,  0.0%ni, 70.1%id, 10.2%wa,  0.0%hi,  0.1%si,  3.9%st
Mem:   4056248k total,  3821012k used,   235236k free,   102564k buffers
Swap:  2097144k total,      124k used,  2097020k free,  1524112k cached

See that 3.9%st? That is lost performance. On CoolVDS, we strictly use KVM (Kernel-based Virtual Machine) with rigid resource isolation. We don't oversell cores, so your Steal Time stays at 0.0%. If you see high steal time on your current host, no amount of Apache tuning will fix it. You need to migrate.

I/O Wait: The Silent Killer

With the adoption of SSDs starting to gain traction in 2013, many of us are still stuck managing legacy SATA spinning rust. The bottleneck is rarely CPU; it is almost always Disk I/O. When your database tries to write to the transaction log and the disk subsystem is saturated, your CPU sits idle in an "iowait" state (%wa in the top output above).

To diagnose this, standard load checks are useless. You need iostat (part of the sysstat package on CentOS 6). Do not just run it once; watch the queue size.

[root@db01 ~]# iostat -x 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.50    0.00    1.20   25.40    0.00   68.90

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
vda               0.00    15.00    2.00   45.00    32.00   850.00    18.77     5.20   85.50   9.50  85.20

Analysis: Look at avgqu-sz (Average Queue Size). A value greater than 1 means requests are queuing up. A value of 5.20 implies your disk subsystem is screaming for mercy. At CoolVDS, we are rolling out pure SSD arrays across our Oslo datacenter to eliminate this specific bottleneck, drastically reducing await times.

Structuring Your Nagios Config for Reality

Stop using the default localhost.cfg. It is noise. For a production Linux environment, you need to monitor services, not just ping. Below is a snippet of a proper NRPE configuration for checking MySQL integrity and disk space, which are the two things that will actually get you fired if they fail.

Inside /etc/nagios/nrpe.cfg on the client:

# Check Disk Space - Warn at 20% free, Critical at 10% free
command[check_disk]=/usr/lib64/nagios/plugins/check_disk -w 20% -c 10% -p /

# Check Load - Normalized for 4 Cores
command[check_load]=/usr/lib64/nagios/plugins/check_load -w 4.0,3.0,2.0 -c 8.0,6.0,4.0

# Check MySQL Replication Lag (Custom Script)
command[check_mysql_slave]=/usr/lib64/nagios/plugins/check_mysql_slave

And here is the bash logic for that custom MySQL slave checker. In 2013, if you aren't running Master-Slave replication, you don't have a backup strategy, you have a hope strategy.

#!/bin/bash
# /usr/lib64/nagios/plugins/check_mysql_slave

LAG=$(mysql -u nagios -p'SecurePass123' -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master" | awk '{ print $2 }')

if [ "$LAG" == "NULL" ]; then
    echo "CRITICAL - Replication is stopped!"
    exit 2
elif [ "$LAG" -gt 60 ]; then
    echo "WARNING - Slave is $LAG seconds behind master"
    exit 1
else
    echo "OK - Slave is $LAG seconds behind master"
    exit 0
fi

The Zabbix Alternative

While Nagios is the industry standard, Zabbix 2.0 (released last year) is proving to be a robust alternative for those who want graphing built-in without wrestling with Cacti or Munin separately. Zabbix agents are lightweight and can push data active-style, which is excellent for complex firewall rules.

However, Zabbix requires a tuning of the backend database. If you deploy Zabbix on a cheap, low-IOPS VPS, the history_uint table will grow massive and kill your performance. This brings us back to infrastructure.

Pro Tip: If you are monitoring clients in Norway, keep your monitoring server in Norway. The latency between an AWS instance in US-East and a server in the NIX (Norwegian Internet Exchange) in Oslo can lead to false timeout alerts during peak traffic windows. Local monitoring for local infrastructure reduces false positives.

Data Sovereignty and Compliance

We are seeing increasing scrutiny from Datatilsynet regarding where personal data is stored. With the current discussions around Safe Harbor and the Personal Data Act (Personopplysningsloven), it is risky to pipe your server logs—which often contain IP addresses—to third-party SaaS monitoring tools hosted outside the EEA.

By hosting your own Nagios or Zabbix instance on a CoolVDS server in Oslo, you keep your monitoring data within Norwegian jurisdiction. You get low latency access to your monitored nodes and full legal compliance.

Final Thoughts

Monitoring is not about installing a package; it is about understanding the hardware limitations beneath your OS. You cannot monitor your way out of bad hardware. If your %st is high or your await is spiking, no configuration change will save you.

Stop fighting for CPU cycles. Deploy a KVM-based, SSD-accelerated instance on CoolVDS today and see what 0% Steal Time looks like on your graphs.