The Lie of "99.9% Uptime"

If I have to wake up at 3:00 AM because PagerDuty is screaming about high latency on a database node, and I log in to find the CPU usage at 10% but the load average at 45, I know exactly what is happening. It's not my code. It's the "noisy neighbor" on the physical host stealing CPU cycles or saturating the storage controller. This is the reality of cheap VPS hosting in 2019.

Most providers give you a sanitized dashboard. They show you averaged-out graphs that hide micro-bursts and I/O wait times. If you are running critical infrastructure in Norway—whether for a high-traffic Magento store or a FinTech backend—you cannot rely on the hypervisor to tell you the truth. You need to scrape it yourself, from the kernel up.

We are going to build a monitoring stack that actually works, using Prometheus 2.11 and Grafana 6. This isn't just about pretty graphs; it's about survival. And it starts with choosing infrastructure that doesn't fight you. This is why we rely on CoolVDS KVM instances; when I run top, I want to see real hardware behavior, not a containerized simulation.

The Stack: Prometheus + Node Exporter

Forget Nagios. In 2019, if you aren't using time-series data, you aren't monitoring; you're just checking heartbeats. We need granular metrics. The standard for this today is the Prometheus ecosystem.

First, we need to expose the kernel metrics. We use the node_exporter binary. Do not install this via `apt` or `yum` because the repo versions are often ancient. Grab the binary directly.

1. Deploying the Exporter

On your target CoolVDS instance (running CentOS 7 or Debian 9), download the latest release:

wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz

Extract it and move the binary to /usr/local/bin. Now, let's create a systemd service to ensure it survives reboots. Reliability is not optional.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.systemd --collector.processes

[Install]
WantedBy=multi-user.target

Reload the daemon and start it:

systemctl daemon-reload && systemctl start node_exporter

You can verify it's working by curling the metrics endpoint locally. You should see raw data immediately:

curl localhost:9100/metrics | grep "node_load"

If you see output like node_load1 0.45, you are live. This bypasses any provider-side dashboard trickery. You are now reading directly from /proc.

Configuring the Scraper (Prometheus)

Now you need a central server to scrape these metrics. I recommend setting up a dedicated CoolVDS instance for this. Monitoring must live outside the failure domain of the application it monitors.

Edit your prometheus.yml. Here is a configuration tuned for a 15-second scrape interval. We want high resolution to catch those micro-spikes that affect NVMe latency.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'coolvds_nodes'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
    basic_auth:
      username: 'admin'
      password: 'REDACTED_SECURE_PASSWORD'

Pro Tip: Never expose port 9100 to the public internet. Use a VPN tunnel or restrict access via iptables to allow only your Prometheus server IP. Security is part of stability.

Visualizing the "Steal"

This is where the rubber meets the road. In Grafana, import dashboard ID 1860 (Node Exporter Full). It is the gold standard in 2019.

Pay close attention to the CPU Steal metric. This is the silent killer.

Metric	What it means	The CoolVDS Difference
iowait	CPU is waiting for disk.	On our NVMe arrays, this should stay near 0%. High iowait on SSDs usually means the host is oversold.
steal	Hypervisor is serving other VMs.	If this crosses 1%, your provider is overloading the physical core. We cap allocation to prevent this.
load15	15-min average load.	Sustained high load without high CPU usage indicates a bottleneck in I/O or memory bandwidth.

The Nordic Context: Latency and Compliance

Why bother hosting this in Norway? Latency and law. If your users are in Oslo or Bergen, routing traffic through Frankfurt adds 20-30ms of round-trip time. In the world of high-frequency trading or real-time gaming, that is an eternity.

By using CoolVDS, your packets hit the NIX (Norwegian Internet Exchange) almost immediately. Check your ping times:

ping -c 5 nix.no

Furthermore, with GDPR fully enforced since last year, data residency is critical. Storing logs and metrics—which often contain IP addresses (PII)—on servers physically located in Norway simplifies your compliance posture with Datatilsynet.

Alerting Before the Crash

Graphs look cool, but alerts wake you up. We configure Alertmanager to ping us on Slack only when it matters. No one reads emails.

Here is a rule to detect if your disk fill rate predicts 100% usage within 4 hours. This uses Prometheus's linear prediction function, predict_linear.

groups:
- name: storage_alerts
  rules:
  - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_free_bytes{job="coolvds_nodes"}[1h], 4 * 3600) < 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Disk is filling up fast on {{ $labels.instance }}"

This is predictive maintenance. You fix the issue at 2 PM, not 2 AM.

Why Infrastructure Choice Dictates Monitoring Accuracy

You cannot monitor what you cannot see. On shared hosting or container-based VPS (OpenVZ), you are often looking at the host's kernel metrics, not your own isolated environment. This leads to false positives.

At CoolVDS, we use KVM (Kernel-based Virtual Machine). When you run uname -r, that is your kernel. When you check /proc/meminfo, that is your RAM. This isolation means your monitoring data is accurate, actionable, and legally defensible. Don't let a budget host ruin your uptime stats.

If you are serious about performance, stop guessing. Spin up a KVM instance, install the exporter, and look at the raw numbers. The difference is usually shocking.

Ready to see the truth? Deploy a high-performance NVMe instance on CoolVDS today and get full root access in under 55 seconds.

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

Stop Trusting CloudWatch: Building Real Infrastructure Observability in 2019

The Lie of "99.9% Uptime"

The Stack: Prometheus + Node Exporter

1. Deploying the Exporter

Configuring the Scraper (Prometheus)

Visualizing the "Steal"

The Nordic Context: Latency and Compliance

Alerting Before the Crash

Why Infrastructure Choice Dictates Monitoring Accuracy

/// RELATED POSTS

Edge Computing in Norway: Architecting for Sub-5ms Latency in 2025

Kubernetes Networking Deep Dive: Optimizing Packet Flow for Low Latency in 2025

Surviving the Packet Storm: A Deep Dive into Kubernetes Networking & CNI Performance in 2025

Surviving the Millisecond War: Edge Computing Architectures for the Nordic Market

Kubernetes Networking Deep Dive: Why Your Packets Are Dropping in the Overlay

Serverless Without the Handcuffs: Implementing Private FaaS Patterns on High-Performance VDS in 2025