Surviving the Spike: A Paranoid Guide to Infrastructure Monitoring
I haven't slept properly since the "Great MySQL Crash" of 2014. If you've been in the trenches long enough, you know the feeling. The phone buzzes at 3:14 AM. Nagios is screaming. Your primary database is locked up, and you have no idea why because your metrics have a 5-minute resolution gap. By the time you SSH in, the load average is 50.0 and the server is effectively a brick.
In the Norwegian hosting market, where customers expect rock-solid stability and the Datatilsynet (Data Protection Authority) watches data sovereignty like a hawk following the Safe Harbor invalidation last year, guessing is not a strategy. It is negligence.
Most VPS providers sell you "vCores" and "RAM" but hide the metrics that actually matter: CPU Steal and Disk Latency. Today, we are going to look at the raw signals that tell you if your infrastructure is healthy or just dying slowly. We will use tools that actually exist today—Zabbix 3.0, good old iostat, and common sense.
The Silent Killer: CPU Steal (%st)
You run top. You see your user CPU is at 20%. You think you are fine. But your application is sluggish. Why? In a virtualized environment, you are sharing physical silicon with other tenants. If your neighbor decides to mine Bitcoin or compile the Linux kernel, a cheap host's hypervisor might pause your CPU cycles to serve them.
This shows up as %st (steal time). If this number is consistently above 0.5%, you are being robbed of the performance you paid for.
Here is what you need to look for in your terminal:
Cpu(s): 12.5%us, 3.2%sy, 0.0%ni, 82.1%id, 0.1%wa, 0.0%hi, 0.1%si, 2.0%st
See that 2.0%st? That means for 2% of the time, your virtual machine wanted to run, but the hypervisor said "wait." In a high-frequency trading app or a busy Magento store, that is an eternity.
Pro Tip: To monitor this programmatically, don't parse top. Use vmstat.
vmstat 1 5 | awk '{print $17}' # Prints steal time column
This is why we architect CoolVDS on KVM with strict resource isolation constraints. We don't over-provision CPU cores to the point of suffocation. When you buy a slice of a Xeon, you get those cycles. No noisy neighbors stealing your lunch.
Disk Latency: The Bottleneck Everyone Ignores
With the rise of SSDs, we got lazy. We assume I/O is infinite. It isn't. Especially if you are on a shared storage backend (SAN) that is congested. High IOPS (Input/Output Operations Per Second) are nice, but latency is king. I don't care if I can do 10,000 IOPS if each one takes 50ms.
To diagnose a sluggish database, stop looking at the slow query log first. Look at the disk queue.
The Diagnostic Command
Run this command on your database server during peak load:
iostat -x 1
You will get a deluge of data. Ignore most of it. Focus on await and %util.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.40 0.00 82.00 0.00 1450.50 35.38 0.65 7.90 0.00 7.90 0.45 3.70
vdb 0.00 0.00 345.00 120.00 8450.00 5400.00 60.00 4.50 12.50 10.00 15.00 2.10 98.50
In the example above, vdb (our data drive) has an await of 12.50ms. That is acceptable for spinning rust, but for an SSD, it's borderline. However, look at %util: 98.50%. The disk is saturated. Requests are queuing up.
If you see await spiking over 20ms on an SSD-based VPS, your provider's storage backend is thrashing. This is frequently why we push NVMe storage at CoolVDS. The queue depth on NVMe is vastly superior to SATA SSDs, allowing parallel request handling without the latency penalty.
Configuring Zabbix 3.0 for Real-Time Alerts
Zabbix 3.0 released recently (February 2016) and it brings a much-needed UI refresh and encryption support. If you are still running Nagios Core with a mess of Perl scripts, it is time to move on.
We need to configure the Zabbix Agent to report active checks. Passive checks (where the server polls the agent) are fine for small setups, but if you are behind a complex firewall or NAT, active checks are more reliable.
Step 1: The Agent Config
Edit /etc/zabbix/zabbix_agentd.conf on your client node:
# /etc/zabbix/zabbix_agentd.conf
# The Zabbix Server IP
Server=192.168.10.5
ServerActive=192.168.10.5
# Unique Hostname (Must match Zabbix Web Interface)
Hostname=db-node-01.oslo.coolvds.net
# Encrypt the traffic (New in 3.0!)
TLSConnect=psk
TLSAccept=psk
TLSPSKIdentity=PSK 001
TLSPSKFile=/etc/zabbix/zabbix_agentd.psk
# deeply critical for MySQL monitoring
UserParameter=mysql.ping,mysqladmin -uroot -p$MY_PASS ping | grep -c alive
UserParameter=mysql.threads,mysqladmin -uroot -p$MY_PASS status | cut -f3 -d":" | cut -f1 -d"Q"
Notice the TLS settings. In a post-Snowden world, sending infrastructure metrics in plain text over the public internet is reckless. Even internal traffic should be encrypted.
Step 2: Generate the PSK
openssl rand -hex 32 > /etc/zabbix/zabbix_agentd.psk
chown zabbix:zabbix /etc/zabbix/zabbix_agentd.psk
chmod 640 /etc/zabbix/zabbix_agentd.psk
Automating the "Fix" with Ansible
Monitoring is useless if it only wakes you up to push a button. Automation should handle the first line of defense. Ansible 2.0 (released January 2016) has made block/rescue error handling much cleaner.
Here is a playbook pattern I use. It checks disk space. If it is critical, it cleans up package caches and old logs automatically. Only if that fails does it page me.
---
- hosts: webservers
tasks:
- block:
- name: Check disk usage
shell: df -h / | tail -1 | awk '{print $5}' | sed 's/%//'
register: disk_usage
failed_when: disk_usage.stdout | int > 90
changed_when: false
rescue:
- name: Emergency cleanup - Apt clean
apt:
autoclean: yes
- name: Emergency cleanup - Remove old logs
shell: find /var/log -name "*.gz" -type f -mtime +30 -delete
- name: Verify disk usage dropped
shell: df -h / | tail -1 | awk '{print $5}' | sed 's/%//'
register: new_disk_usage
failed_when: new_disk_usage.stdout | int > 90
The Norwegian Context: Latency and Jurisdiction
Since the Safe Harbor agreement collapsed in October 2015, storing customer data on US-controlled clouds (AWS, Azure) has entered a legal grey area. Many of my clients are migrating back to European soil to satisfy internal compliance teams.
But beyond compliance, there is physics. Speed of light is constant. If your users are in Oslo, serving them from Frankfurt adds 20-30ms round trip. Serving them from Virginia adds 90ms+.
Let's look at a trace from a local ISP in Bergen to a CoolVDS instance in Oslo:
mtr --report --report-cycles=10 185.x.x.x
Result: Average latency 4.2ms. No packet loss. This low latency makes the TCP handshake snap and SSL negotiation feel instant. In e-commerce, milliseconds equal revenue.
Final Thoughts
You cannot manage what you do not measure. If you are relying on the "green light" in your hosting control panel, you are flying blind. Implement iostat checks, set up Zabbix active agents with encryption, and ensure your CPU steal time remains at zero.
If you are tired of fighting for resources on oversold platforms, it might be time to test your stack on infrastructure that respects your need for raw performance. Don't let slow I/O kill your SEO. Deploy a test NVMe instance on CoolVDS in 55 seconds.