Stop Trusting Top: The Real Art of Infrastructure Monitoring at Scale
It is 3:00 AM. My pager is screaming. Again. Nagios says the load average on the database master is 15.0. I SSH in, sweat forming, expecting a crash. I run top. CPU usage is... 4%? Memory is fine. What is happening?
If you have managed servers for more than a week, you know this ghost. It’s called I/O Wait, and on standard hosting with spinning SAS drives, it is the silent killer of availability. In 2013, if you are still relying on basic ping checks and default load thresholds, you are flying blind.
We are building complex stacks now. Nginx in front of Apache, memcached layers, MySQL replication setups. The "is it up?" check is dead. The new question is: "Is it fast, and will it survive the next hour?"
The Lie of Shared Resources (and OpenVZ)
Before we touch configuration, we need to address the platform. Most budget VPS providers in Europe love OpenVZ. It’s cheap, it allows them to oversell RAM, and it’s a nightmare for accurate monitoring.
On a container-based system like OpenVZ, /proc/stat is often virtualized incorrectly. You might see low CPU usage, but your application is stalling. Why? Steal Time. Your "neighbors" on the host node are chewing up the CPU cycles before the hypervisor even schedules your process.
Pro Tip: Always check %st (steal time) in top. If it’s consistently above 5-10%, move providers immediately. This is why we insist on KVM virtualization at CoolVDS. Hardware virtualization exposes the real CPU registers. No hiding.
The Toolchain: Nagios, Munin, and The Truth
For immediate alerting, Nagios is still the industry standard, despite the config file headaches. But Nagios only tells you binary states: OK or CRITICAL. It doesn't tell you trends. For that, you need Munin or Cacti. Seeing a graph of inode usage slowly creeping up over a month saves you from a catastrophic outage that no instant check will catch.
1. Monitoring Disk Latency, Not Just Usage
Most sysadmins check disk space. Few check disk speed. When a MySQL query goes to disk, milliseconds matter. Here is a custom script approach we use to alert if disk writes take longer than 50ms.
#!/bin/bash
# check_disk_latency.sh
# Usage: ./check_disk_latency.sh sda 50 100
DEVICE=$1
WARN=$2
CRIT=$3
# Get await time from iostat (requires sysstat package)
AWAIT=$(iostat -x -d 1 2 $DEVICE | tail -1 | awk '{print $10}' | cut -d. -f1)
if [ "$AWAIT" -ge "$CRIT" ]; then
echo "CRITICAL - Disk latency is ${AWAIT}ms | await=${AWAIT}ms";
exit 2;
elif [ "$AWAIT" -ge "$WARN" ]; then
echo "WARNING - Disk latency is ${AWAIT}ms | await=${AWAIT}ms";
exit 1;
else
echo "OK - Disk latency is ${AWAIT}ms | await=${AWAIT}ms";
exit 0;
fi
Hook this into NRPE. If you see this trigger often, your "Enterprise Storage" is likely just a shared SATA array choking on someone else's backup job.
2. The Nginx Stub Status
Apache’s `mod_status` is heavy. Nginx is light. If you are serving high-traffic sites in Norway, you need to know exactly how many active connections you have to tune your `worker_processes`.
Ensure your nginx.conf has this block (protected by IP!):
server {
listen 127.0.0.1:80;
server_name localhost;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
}
Then, verify it with curl:
$ curl http://127.0.0.1/nginx_status
Active connections: 243
server accepts handled requests
11043 11043 34920
Reading: 0 Writing: 25 Waiting: 218
If "Waiting" is high, you are likely KeepAlive-bound. If "Reading" is high, your clients have slow latency—common if you are hosting in Germany but serving users in Oslo. Distance matters. Light travels fast, but routing tables are slow.
MySQL: The Buffer Pool Ratio
The single most important metric for MySQL performance is the InnoDB Buffer Pool Hit Rate. You want this as close to 100% as possible. If it drops to 90%, your database is reading from the physical disk 10% of the time. On a spinning HDD, that is death.
Add this to your monitoring scripts to calculate the ratio in real-time:
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';" | awk '
/Innodb_buffer_pool_read_requests/ { req = $2 }
/Innodb_buffer_pool_reads/ { disk = $2 }
END { printf "Hit Rate: %.4f%%\n", (1 - disk/req) * 100 }'
If you can't keep this number above 99%, you have two options: Buy more RAM, or switch to SSD storage where a disk read doesn't take 10ms.
Latency and Data Sovereignty in Norway
We are seeing more businesses concerned about where their data physically lives. The Norwegian Data Inspectorate (Datatilsynet) is becoming stricter regarding the Personal Data Act. Hosting your data in a US cloud might seem easy, but latency to Oslo and legal ambiguity are real costs.
A ping from Oslo to a datacenter in Virginia is ~110ms. From Oslo to CoolVDS's Oslo zone? ~2ms. For a Magento store executing 50 sequential PHP requests, that latency compounds.
| Metric | Standard SATA VPS | CoolVDS SSD VPS |
|---|---|---|
| Random IOPS | ~80 - 120 | ~5,000+ |
| Disk Latency | 5ms - 20ms | < 0.5ms |
| Backup Impact | High Load | Negligible |
The Hardware Solution
Software monitoring reveals the bottlenecks, but hardware fixes them. In 2013, the biggest leap you can make is abandoning spinning rust for Solid State Drives (SSD).
At CoolVDS, we have standardized on Pure SSD RAID-10 arrays and KVM virtualization. We don't worry about I/O wait because the underlying hardware pushes throughputs that traditional SATA arrays can only dream of. When you deploy with us, you aren't fighting for disk time with a noisy neighbor.
If your iostat is showing you high await times and your clients are complaining about slow page loads, stop tweaking your my.cnf and look at your infrastructure. Monitoring is only useful if you have the power to act on it.
Ready to kill I/O wait for good? Deploy a high-performance KVM instance in our Oslo datacenter today. It takes 55 seconds.