Infrastructure Monitoring at Scale: Why Your "Uptime" Metric is Lying to You
It was 3:15 AM on a Tuesday when my pager went off. The alert said "Server Load High." I logged in via SSH. The site was up. Nginx was responding. But the Checkout button on a client's high-traffic Magento store was taking 12 seconds to process a request. Technically, we had 100% uptime. Practically, we were losing thousands of kroner per minute.
The culprit? CPU Steal Time. Our previous budget provider had oversold the physical host so aggressively that our VM was waiting in line just to execute basic instructions.
Most VPS providers in the crowded European market lie to you. They sell you vCPUs that don't exist and RAM that is ballooned out to swap. If you are serious about infrastructure, you stop looking at "Up/Down" and start looking at saturation, latency, and traffic. Here is how we build a monitoring stack that actually tells the truth, using tools available right now in 2019.
The Silent Killer: iowait and Steal Time
Before installing any fancy dashboards, you need to know how to spot a bad host from the command line. If you deploy on a CoolVDS instance, you likely won't see these numbers move because we use KVM with strict resource guarantees, but on budget clouds, this is your reality check.
Run vmstat 1 and watch the columns.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 102400 45000 560000 0 0 10 2 50 60 5 2 90 1 2
4 1 0 102100 45000 560200 0 0 5000 0 120 200 10 5 50 30 5
Focus on the last two columns:
- wa (Wait I/O): The CPU is idle ONLY because it's waiting for the disk. If this is high, your storage is too slow (HDD or cheap SATA SSD). This kills database performance.
- st (Steal Time): The hypervisor is stealing cycles from your VM to serve another customer. If this is consistently above 0, move your data immediately.
Pro Tip: On CoolVDS NVMe instances, we typically seewaat 0 andstat 0.0. Why? Because we don't oversell our cores, and NVMe throughput (reading at 3000MB/s) prevents the CPU from waiting.
The Stack: Prometheus & Grafana (2019 Standard)
Forget Nagios. Configuring XML files in 2019 is a waste of billable hours. The industry standard right now is Prometheus for time-series data and Grafana for visualization. This setup pulls metrics rather than waiting for an agent to push them, which is cleaner for firewall management.
1. Deploying the Exporters
First, you need the node_exporter on every target server. This exposes kernel-level metrics. Don't run this as root if you can avoid it.
# Create a user for the exporter
useradd --no-create-home --shell /bin/false node_exporter
# Download version 0.17.0 (Current stable as of early 2019)
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
tar xvf node_exporter-0.17.0.linux-amd64.tar.gz
cp node_exporter-0.17.0.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Systemd service file /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
2. Configuring Prometheus
On your monitoring server (preferably a separate CoolVDS instance to ensure monitoring survives a cluster failure), configure prometheus.yml. We want a scrape interval of 15 seconds. Anything less is noise; anything more misses micro-bursts.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-nodes'
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
- job_name: 'mysql-primary'
static_configs:
- targets: ['10.0.0.7:9104']
Compliance and the "NIX" Factor
We operate in Norway. This adds two layers of complexity: Latency and Legality.
- GDPR & Datatilsynet: If you are monitoring logs that contain IP addresses or User IDs, that is PII (Personally Identifiable Information). Storing these logs on a US-based cloud server technically violates GDPR principles regarding data sovereignty unless you have air-tight processing agreements. Keeping your monitoring stack on a VPS in Norway (like our Oslo datacenter) simplifies this. You stay within the jurisdiction of Norwegian law.
- Latency to NIX: The Norwegian Internet Exchange (NIX) is the heart of connectivity in Oslo. If your monitoring server is in Frankfurt but your customers are in Bergen, your latency alerts will be skewed by network hops. Local peering matters.
Alerting: Signal vs. Noise
The biggest mistake I see junior sysadmins make is alerting on CPU usage. Do not alert if CPU > 90%.
Why? If a background compression job runs for 10 minutes, CPU will be 100%, but the server is fine. Alert on symptoms, not causes. Alert if the website response time > 2 seconds. Alert if error rates > 1%.
Here is a practical alert.rules.yml for Prometheus:
groups:
- name: host_alerting
rules:
# Alert if instance is down for 1 minute
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
# Alert on high disk latency (Read/Write > 100ms)
- alert: SlowDisk
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "Disk latency high on {{ $labels.instance }}"
The Hardware Reality Check
You can tune sysctl.conf and optimize Nginx buffers all day, but software cannot fix bad hardware physics. In 2019, spinning HDDs are obsolete for root filesystems.
| Storage Type | Avg IOPS (4K Random) | Typical Latency | Verdict |
|---|---|---|---|
| 7.2k SATA HDD | ~80-100 | 10-15 ms | Backup Only |
| Standard SSD | ~5,000-10,000 | 0.5-2 ms | Acceptable |
| CoolVDS NVMe | ~20,000+ | < 0.1 ms | Production Standard |
When you are running a database cluster, that latency difference between 2ms and 0.1ms aggregates. With 100 queries per page load, that is the difference between a "snappy" feel and a sluggish one.
Conclusion
Stop trusting the "Green Checkmark" on your provider's status page. Implement your own metrics. Watch for steal time. Keep your data within Norwegian borders to keep Datatilsynet happy.
If you are tired of debugging latency that turns out to be your host's fault, it is time to switch infrastructure.
Don't let slow I/O kill your SEO or your sleep. Deploy a high-performance NVMe instance on CoolVDS in 55 seconds and see what 0.0% steal time feels like.