Console Login

Escaping the Nagios Trap: Modern Infrastructure Monitoring with Prometheus 2.0 on High-Performance KVM

It is 3:14 AM. Your phone buzzes. It’s Nagios. Again. It says "Disk Space Critical" on a staging server you haven't touched in three months. You acknowledge the alert, go back to sleep, and wake up two hours later to find your primary database locked up because an unmonitored IOPS spike choked the replication thread. If this sounds familiar, your monitoring strategy is stuck in 2010.

We are halfway through 2018. With the recent release of Prometheus 2.0 and Grafana 5, the barrier to entry for time-series monitoring has collapsed. Yet, I still see sysadmins in Oslo writing Bash scripts to email them when `loadavg` hits 5.0. That is not monitoring; that is anxiety automation.

In this guide, we are going to build a monitoring stack that actually provides insight rather than noise. We will focus on the Prometheus 2.0 ecosystem, deployed on CoolVDS KVM instances, ensuring we respect the new GDPR regulations that the Datatilsynet (Norwegian Data Protection Authority) is now enforcing strictly since May 25th.

The "White Noise" Problem in Traditional Hosting

The biggest lie in the VPS industry is "guaranteed resources" on container-based virtualization (like OpenVZ). When your neighbor decides to mine cryptocurrency or compile a kernel, your metrics go haywire. Your CPU steal time spikes, but your internal monitoring shows normal usage. You are chasing ghosts.

Pro Tip: Always monitor node_cpu_seconds_total{mode="steal"}. If this metric is consistently above 1-2% on your current provider, you are paying for performance you aren't getting. This is why we exclusively use KVM at CoolVDS—hardware isolation means your metrics reflect your workload, not your neighbor's.

Step 1: The Exporter Strategy (Stop Using SNMP)

Forget SNMP. It’s heavy, insecure, and painful to debug. In the Prometheus ecosystem, we use Exporters. These are lightweight binaries that expose metrics over HTTP. The standard for Linux boxes is the node_exporter.

Here is how to deploy it cleanly as a systemd service (don't run it in a screen session, please):

# Create a dedicated user for security
useradd --no-create-home --shell /bin/false node_exporter

# Download the binary (Version 0.16.0 is current as of May 2018)
wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz
tar xvf node_exporter-0.16.0.linux-amd64.tar.gz
cp node_exporter-0.16.0.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter

Now, define the service definition. We want to enable the systemd collector to track service restarts.

# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter --collector.systemd

[Install]
WantedBy=multi-user.target

Reload the daemon and start it: systemctl daemon-reload && systemctl start node_exporter.

Step 2: Configuring Prometheus 2.0

Prometheus 2.0 brought a massive performance improvement to the TSDB (Time Series Database). It now uses an immutable block structure that drastically reduces I/O pressure—critical if you are running your monitoring stack on shared storage (though on CoolVDS NVMe storage, this is less of a concern).

Here is a battle-tested prometheus.yml configuration. Note the scrape interval. Many tutorials say 15 seconds. In a high-traffic environment, a lot can happen in 15 seconds. If you have the I/O bandwidth, go lower.

# /etc/prometheus/prometheus.yml
global:
  scrape_interval:     10s
  evaluation_interval: 10s

scrape_configs:
  - job_name: 'coolvds-nodes'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
    
    # Tagging relies on the "instance" label usually, but adding environment labels helps
    relabel_configs:
      - source_labels: [__address__]
        regex: '10\.0\.0\.5:9100'
        target_label: environment
        replacement: production

The Query That Matters: Predicting Saturation

Alerting on "Disk Space > 90%" is useless if the disk fills up in 10 minutes. You need to alert on the rate of growth. This PromQL query calculates how many hours you have left until the disk is full:

predict_linear(node_filesystem_free_bytes{job="coolvds-nodes"}[1h], 4 * 3600) < 0

This alerts you if the disk is projected to fill up within the next 4 hours based on the last hour of data. This gives you time to react, rather than time to panic.

Step 3: Visualization with Grafana 5

Grafana 5 was released just a few months ago (March 2018), introducing the new dashboard layout engine. It is smoother and handles mobile views better, which is helpful when checking status on the Flytoget train to Gardermoen.

When connecting Prometheus to Grafana, ensure you are using KeepAlive connections. We have seen latency spikes in monitoring dashboards simply because the TCP handshake overhead was too high when scraping hundreds of endpoints over TLS.

The GDPR & Data Residency Angle

We cannot ignore the elephant in the room. As of last month (May 2018), GDPR is enforceable. Datatilsynet is not to be trifled with.

If you are pushing your server metrics (which often contain IP addresses, user IDs in logs, or process names) to a US-based SaaS monitoring platform, you are stepping into a legal minefield regarding data export.

Hosting your own Prometheus stack on CoolVDS in our Oslo data center solves this immediately:

  • Data Sovereignty: Your metrics never leave Norway.
  • Latency: Scrape latency affects the resolution of your data. Pushing metrics across the Atlantic adds 100ms+ jitter. Scraping locally within the NIX (Norwegian Internet Exchange) infrastructure keeps it under 2ms.

Why Infrastructure Choice Dictates Monitoring Success

I once consulted for a media firm trying to debug intermittent SQL timeouts. Their monitoring showed CPU usage at 40%. They were baffled.

We dug deeper and looked at node_disk_io_time_seconds_total. The graph looked like a sawtooth wave. They were on a cheap "Cloud VPS" provider that throttled IOPS heavily once a burst bucket was exhausted. The CPU was fine; the processes were just in D-state (Uninterruptible Sleep) waiting for the disk.

Metric Standard HDD VPS CoolVDS NVMe KVM
Random Read IOPS ~150 - 300 10,000+
I/O Wait (Average) High (variable) Near Zero
Metric Accuracy Low (Noisy Neighbors) High (KVM Isolation)

We migrated them to a CoolVDS instance with NVMe storage. The I/O wait vanished. The SQL timeouts stopped. The monitoring charts finally made sense.

Conclusion

In 2018, monitoring is not just about checking if the server is up. It is about understanding the granular behavior of your applications and the hardware underneath them. You need tools like Prometheus 2.0 that can handle high cardinality, and you need infrastructure that doesn't lie to you.

Don't let slow I/O or noisy neighbors kill your uptime. Deploy a high-performance KVM instance with local NVMe storage today.

Deploy your Prometheus stack on CoolVDS – Starting at 55 seconds to boot.