Console Login

Stop Trusting Top: The Real Art of Infrastructure Monitoring at Scale

Stop Trusting Top: The Real Art of Infrastructure Monitoring at Scale

It is 3:00 AM. My pager is screaming. Again. Nagios says the load average on the database master is 15.0. I SSH in, sweat forming, expecting a crash. I run top. CPU usage is... 4%? Memory is fine. What is happening?

If you have managed servers for more than a week, you know this ghost. It’s called I/O Wait, and on standard hosting with spinning SAS drives, it is the silent killer of availability. In 2013, if you are still relying on basic ping checks and default load thresholds, you are flying blind.

We are building complex stacks now. Nginx in front of Apache, memcached layers, MySQL replication setups. The "is it up?" check is dead. The new question is: "Is it fast, and will it survive the next hour?"

The Lie of Shared Resources (and OpenVZ)

Before we touch configuration, we need to address the platform. Most budget VPS providers in Europe love OpenVZ. It’s cheap, it allows them to oversell RAM, and it’s a nightmare for accurate monitoring.

On a container-based system like OpenVZ, /proc/stat is often virtualized incorrectly. You might see low CPU usage, but your application is stalling. Why? Steal Time. Your "neighbors" on the host node are chewing up the CPU cycles before the hypervisor even schedules your process.

Pro Tip: Always check %st (steal time) in top. If it’s consistently above 5-10%, move providers immediately. This is why we insist on KVM virtualization at CoolVDS. Hardware virtualization exposes the real CPU registers. No hiding.

The Toolchain: Nagios, Munin, and The Truth

For immediate alerting, Nagios is still the industry standard, despite the config file headaches. But Nagios only tells you binary states: OK or CRITICAL. It doesn't tell you trends. For that, you need Munin or Cacti. Seeing a graph of inode usage slowly creeping up over a month saves you from a catastrophic outage that no instant check will catch.

1. Monitoring Disk Latency, Not Just Usage

Most sysadmins check disk space. Few check disk speed. When a MySQL query goes to disk, milliseconds matter. Here is a custom script approach we use to alert if disk writes take longer than 50ms.

#!/bin/bash
# check_disk_latency.sh
# Usage: ./check_disk_latency.sh sda 50 100

DEVICE=$1
WARN=$2
CRIT=$3

# Get await time from iostat (requires sysstat package)
AWAIT=$(iostat -x -d 1 2 $DEVICE | tail -1 | awk '{print $10}' | cut -d. -f1)

if [ "$AWAIT" -ge "$CRIT" ]; then
    echo "CRITICAL - Disk latency is ${AWAIT}ms | await=${AWAIT}ms";
    exit 2;
elif [ "$AWAIT" -ge "$WARN" ]; then
    echo "WARNING - Disk latency is ${AWAIT}ms | await=${AWAIT}ms";
    exit 1;
else
    echo "OK - Disk latency is ${AWAIT}ms | await=${AWAIT}ms";
    exit 0;
fi

Hook this into NRPE. If you see this trigger often, your "Enterprise Storage" is likely just a shared SATA array choking on someone else's backup job.

2. The Nginx Stub Status

Apache’s `mod_status` is heavy. Nginx is light. If you are serving high-traffic sites in Norway, you need to know exactly how many active connections you have to tune your `worker_processes`.

Ensure your nginx.conf has this block (protected by IP!):

server {
    listen 127.0.0.1:80;
    server_name localhost;

    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

Then, verify it with curl:

$ curl http://127.0.0.1/nginx_status
Active connections: 243 
server accepts handled requests
 11043 11043 34920 
Reading: 0 Writing: 25 Waiting: 218

If "Waiting" is high, you are likely KeepAlive-bound. If "Reading" is high, your clients have slow latency—common if you are hosting in Germany but serving users in Oslo. Distance matters. Light travels fast, but routing tables are slow.

MySQL: The Buffer Pool Ratio

The single most important metric for MySQL performance is the InnoDB Buffer Pool Hit Rate. You want this as close to 100% as possible. If it drops to 90%, your database is reading from the physical disk 10% of the time. On a spinning HDD, that is death.

Add this to your monitoring scripts to calculate the ratio in real-time:

mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%';" | awk '
/Innodb_buffer_pool_read_requests/ { req = $2 }
/Innodb_buffer_pool_reads/ { disk = $2 }
END { printf "Hit Rate: %.4f%%\n", (1 - disk/req) * 100 }'

If you can't keep this number above 99%, you have two options: Buy more RAM, or switch to SSD storage where a disk read doesn't take 10ms.

Latency and Data Sovereignty in Norway

We are seeing more businesses concerned about where their data physically lives. The Norwegian Data Inspectorate (Datatilsynet) is becoming stricter regarding the Personal Data Act. Hosting your data in a US cloud might seem easy, but latency to Oslo and legal ambiguity are real costs.

A ping from Oslo to a datacenter in Virginia is ~110ms. From Oslo to CoolVDS's Oslo zone? ~2ms. For a Magento store executing 50 sequential PHP requests, that latency compounds.

Metric Standard SATA VPS CoolVDS SSD VPS
Random IOPS ~80 - 120 ~5,000+
Disk Latency 5ms - 20ms < 0.5ms
Backup Impact High Load Negligible

The Hardware Solution

Software monitoring reveals the bottlenecks, but hardware fixes them. In 2013, the biggest leap you can make is abandoning spinning rust for Solid State Drives (SSD).

At CoolVDS, we have standardized on Pure SSD RAID-10 arrays and KVM virtualization. We don't worry about I/O wait because the underlying hardware pushes throughputs that traditional SATA arrays can only dream of. When you deploy with us, you aren't fighting for disk time with a noisy neighbor.

If your iostat is showing you high await times and your clients are complaining about slow page loads, stop tweaking your my.cnf and look at your infrastructure. Monitoring is only useful if you have the power to act on it.

Ready to kill I/O wait for good? Deploy a high-performance KVM instance in our Oslo datacenter today. It takes 55 seconds.