Architecture War Stories: Scaling Infrastructure Monitoring Without The Noise
It is 3:00 AM. The pager screams. Your Magento cluster in Oslo is timing out. You check your cloud provider's default monitoring dashboard, and it shows CPU usage at a comfortable 40%. You go back to sleep. Ten minutes later, the site is down hard. Why? because your "monitoring" is showing you a smoothed-out average from ten minutes ago, while your actual CPU steal time is spiking to 90% every few seconds due to noisy neighbors.
I have seen this scenario play out in dozens of post-mortems. In 2019, if you are still relying on passive, polled SNMP checks or the default graphs provided by budget VPS hosts, you are flying blind. Infrastructure monitoring at scale isn't about pretty charts; it is about granularity, retention, and knowing exactly when your disk I/O is acting up before the customers complain.
This is not a guide on how to install Nagios. This is a look at how we handle observability for high-throughput environments, using the Prometheus stack, and why underlying hardware architecture—specifically KVM and NVMe—dictates the reliability of your metrics.
The Lie of "Averages" and The CPU Steal Trap
Most legacy monitoring tools poll servers every 5 minutes. In the world of high-frequency trading or high-traffic e-commerce, 5 minutes is an eternity. A micro-burst of traffic lasting 30 seconds can crash a database lock, yet appear as a minor blip on a 5-minute average graph.
To catch these ghosts, you need high-resolution scraping. We recommend a 15-second scrape interval for critical services.
However, the metric most sysadmins ignore is %st (Steal Time). This measures the time your virtual CPU waits for the physical hypervisor to give it attention. On oversold OpenVZ or budget cloud platforms, this number kills performance.
Here is how you check it instantly in the terminal:
top -b -n 1 | grep "Cpu(s)" | awk '{print $16 " steal"}'
If that number is consistently above 1-2%, your host is overselling resources. This is why we enforce strict KVM isolation at CoolVDS; we don't allow neighbors to eat your cycles. If you see high steal time, no amount of software tuning will fix it. You need to migrate.
Building the Stack: Prometheus + Node Exporter
In 2019, Prometheus is the undisputed standard for time-series data. Unlike push-based systems (like Graphite), Prometheus pulls data. This ensures your monitoring system doesn't get flooded if a rogue service goes haywire.
Here is a production-ready prometheus.yml configuration block optimized for a mid-sized fleet. Note the specific scrape intervals:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-nodes'
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
relabel_configs:
- source_labels: [__address__]
regex: '(.*):9100'
target_label: instance
replacement: '${1}'
You need node_exporter running on every target machine. Don't use the default settings blindly. Enable the systemd collector to track service restarts.
Create a systemd service file at /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--no-collector.wifi
[Install]
WantedBy=multi-user.target
Disk I/O: The Bottleneck You Can't Cache Away
You can throw RAM at a database, but eventually, you have to write to disk. In a Norway-based infrastructure handling GDPR-sensitive logs, write latency is critical. Spinning rust (HDD) creates "I/O Wait" spikes that look like CPU load but are actually disk blocking.
To diagnose this, iostat is your weapon of choice. Install sysstat if you haven't already.
# Check extended disk statistics every 1 second
iostat -x 1
Metric to watch: await. This is the average time (in milliseconds) for I/O requests issued to the device to be served. If this exceeds 10ms on an SSD, you have a problem. On CoolVDS NVMe instances, we typically see this value below 1ms. If you are hosting a heavy MySQL workload, this difference is the gap between a 200ms page load and a 2s page load.
Pro Tip: When configuring AlertManager, set a trigger for predict_linear on disk space. It's better to be alerted that the disk will fill up in 4 hours than to be alerted when it is full.
Nginx Metrics for the Paranoid
Don't just monitor the OS. You need to know what the web server is doing. Nginx has a stub_status module that is lightweight and essential.
Add this to your nginx.conf inside a server block restricted to localhost:
location /metrics {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
Then, use the nginx-prometheus-exporter sidecar to translate this into Prometheus format. This allows you to graph "Active Connections" vs. "Dropped Connections" in Grafana 6.0.
The Geographic Edge: Why Oslo Matters
Latency is a physical constraint. Speed of light in fiber is finite. If your dev team is in Oslo or Bergen, or your customers are Scandinavian, hosting in Frankfurt adds 15-20ms of round-trip time (RTT). Hosting in the US adds 100ms+.
When monitoring distributed systems, we use mtr to verify network paths. We peer directly at NIX (Norwegian Internet Exchange) to keep local traffic local.
mtr --report --report-cycles=10 193.x.x.x
If you see packet loss at the hops just before the destination, it is likely a DDoS mitigation filter kicking in or a saturated uplink at the provider edge. We monitor our uplinks for saturation 24/7/365, but you should verify your provider isn't throttling you during peak Netflix hours.
Compliance and Data Sovereignty
With GDPR fully enforceable since last year, where you store your monitoring logs matters. IP addresses in access logs are PII (Personally Identifiable Information). If you are shipping your logs to a US-based SaaS monitoring platform, you are navigating a legal minefield regarding data export.
Self-hosting your Prometheus and ELK stack on CoolVDS servers in Norway satisfies the data residency requirements referenced by Datatilsynet. You keep the data on your encrypted NVMe partitions, under your control, within the EEA.
Final Thoughts
Monitoring is not a "set and forget" task. It requires an architecture that supports high-frequency writes and low-latency checks. If your current VPS struggles to handle the I/O of the monitoring tool itself, it definitely can't handle your production workload.
Stop guessing why your server is slow. Deploy a Prometheus stack on a CoolVDS NVMe instance today and see what your infrastructure is actually doing.