The Lie of "99.9% Uptime"
If I have to wake up at 3:00 AM because PagerDuty is screaming about high latency on a database node, and I log in to find the CPU usage at 10% but the load average at 45, I know exactly what is happening. It's not my code. It's the "noisy neighbor" on the physical host stealing CPU cycles or saturating the storage controller. This is the reality of cheap VPS hosting in 2019.
Most providers give you a sanitized dashboard. They show you averaged-out graphs that hide micro-bursts and I/O wait times. If you are running critical infrastructure in Norwayâwhether for a high-traffic Magento store or a FinTech backendâyou cannot rely on the hypervisor to tell you the truth. You need to scrape it yourself, from the kernel up.
We are going to build a monitoring stack that actually works, using Prometheus 2.11 and Grafana 6. This isn't just about pretty graphs; it's about survival. And it starts with choosing infrastructure that doesn't fight you. This is why we rely on CoolVDS KVM instances; when I run top, I want to see real hardware behavior, not a containerized simulation.
The Stack: Prometheus + Node Exporter
Forget Nagios. In 2019, if you aren't using time-series data, you aren't monitoring; you're just checking heartbeats. We need granular metrics. The standard for this today is the Prometheus ecosystem.
First, we need to expose the kernel metrics. We use the node_exporter binary. Do not install this via `apt` or `yum` because the repo versions are often ancient. Grab the binary directly.
1. Deploying the Exporter
On your target CoolVDS instance (running CentOS 7 or Debian 9), download the latest release:
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz
Extract it and move the binary to /usr/local/bin. Now, let's create a systemd service to ensure it survives reboots. Reliability is not optional.
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.systemd --collector.processes
[Install]
WantedBy=multi-user.target
Reload the daemon and start it:
systemctl daemon-reload && systemctl start node_exporter
You can verify it's working by curling the metrics endpoint locally. You should see raw data immediately:
curl localhost:9100/metrics | grep "node_load"
If you see output like node_load1 0.45, you are live. This bypasses any provider-side dashboard trickery. You are now reading directly from /proc.
Configuring the Scraper (Prometheus)
Now you need a central server to scrape these metrics. I recommend setting up a dedicated CoolVDS instance for this. Monitoring must live outside the failure domain of the application it monitors.
Edit your prometheus.yml. Here is a configuration tuned for a 15-second scrape interval. We want high resolution to catch those micro-spikes that affect NVMe latency.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds_nodes'
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
basic_auth:
username: 'admin'
password: 'REDACTED_SECURE_PASSWORD'
Pro Tip: Never expose port 9100 to the public internet. Use a VPN tunnel or restrict access via iptables to allow only your Prometheus server IP. Security is part of stability.
Visualizing the "Steal"
This is where the rubber meets the road. In Grafana, import dashboard ID 1860 (Node Exporter Full). It is the gold standard in 2019.
Pay close attention to the CPU Steal metric. This is the silent killer.
| Metric | What it means | The CoolVDS Difference |
|---|---|---|
| iowait | CPU is waiting for disk. | On our NVMe arrays, this should stay near 0%. High iowait on SSDs usually means the host is oversold. |
| steal | Hypervisor is serving other VMs. | If this crosses 1%, your provider is overloading the physical core. We cap allocation to prevent this. |
| load15 | 15-min average load. | Sustained high load without high CPU usage indicates a bottleneck in I/O or memory bandwidth. |
The Nordic Context: Latency and Compliance
Why bother hosting this in Norway? Latency and law. If your users are in Oslo or Bergen, routing traffic through Frankfurt adds 20-30ms of round-trip time. In the world of high-frequency trading or real-time gaming, that is an eternity.
By using CoolVDS, your packets hit the NIX (Norwegian Internet Exchange) almost immediately. Check your ping times:
ping -c 5 nix.no
Furthermore, with GDPR fully enforced since last year, data residency is critical. Storing logs and metricsâwhich often contain IP addresses (PII)âon servers physically located in Norway simplifies your compliance posture with Datatilsynet.
Alerting Before the Crash
Graphs look cool, but alerts wake you up. We configure Alertmanager to ping us on Slack only when it matters. No one reads emails.
Here is a rule to detect if your disk fill rate predicts 100% usage within 4 hours. This uses Prometheus's linear prediction function, predict_linear.
groups:
- name: storage_alerts
rules:
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_free_bytes{job="coolvds_nodes"}[1h], 4 * 3600) < 0
for: 5m
labels:
severity: page
annotations:
summary: "Disk is filling up fast on {{ $labels.instance }}"
This is predictive maintenance. You fix the issue at 2 PM, not 2 AM.
Why Infrastructure Choice Dictates Monitoring Accuracy
You cannot monitor what you cannot see. On shared hosting or container-based VPS (OpenVZ), you are often looking at the host's kernel metrics, not your own isolated environment. This leads to false positives.
At CoolVDS, we use KVM (Kernel-based Virtual Machine). When you run uname -r, that is your kernel. When you check /proc/meminfo, that is your RAM. This isolation means your monitoring data is accurate, actionable, and legally defensible. Don't let a budget host ruin your uptime stats.
If you are serious about performance, stop guessing. Spin up a KVM instance, install the exporter, and look at the raw numbers. The difference is usually shocking.
Ready to see the truth? Deploy a high-performance NVMe instance on CoolVDS today and get full root access in under 55 seconds.