Console Login

Sleep Through the Night: Architecting Bulletproof Infrastructure Monitoring at Scale (2020 Edition)

Sleep Through the Night: Architecting Bulletproof Infrastructure Monitoring at Scale

The sound of PagerDuty triggering at 03:14 AM is a trauma specific to systems administrators. It’s never a convenient time. Last month, I watched a clustered database implementation implode because we were monitoring uptime (ICMP ping) but ignoring disk I/O latency. The server was "up," but the wait times were so high that the application had effectively timed out. The dashboard was green. The customers were furious.

Monitoring at scale isn't about collecting more data. It's about collecting the right data and filtering out the noise. If everything is an alert, nothing is.

In the Norwegian hosting landscape, where data residency (GDPR) and latency to the NIX (Norwegian Internet Exchange) are critical, you cannot rely on slow, external US-based SaaS monitoring tools that lag by minutes. You need to own your metrics. Here is how we build a monitoring stack that actually works using Prometheus and Grafana on CoolVDS KVM instances.

The Stack: Why Pull beats Push (Mostly)

In 2020, the debate is largely settled. For infrastructure metrics, Prometheus (the pull model) is superior to the old Push approaches (like Graphite or Zabbix agents in active mode) for dynamic environments. Why? Because your monitoring system knows what should be there. If a node doesn't answer, Prometheus knows immediately. If a push agent dies, silence is often interpreted as "no news is good news" until it's too late.

Step 1: The Foundation (Node Exporter)

Don't overcomplicate this. We need raw metrics from the kernel. We use node_exporter on every single target Ubuntu 18.04 LTS instance.

Here is the battle-tested systemd service file we use. Note the flag to disable collectors you don't need—this saves CPU cycles on your busy frontend servers.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
    --collector.disable-defaults \
    --collector.cpu \
    --collector.meminfo \
    --collector.filesystem \
    --collector.netdev \
    --collector.loadavg \
    --collector.diskstats

[Install]
WantedBy=multi-user.target

This configuration strips away the bloat. We only care about CPU, memory, filesystem, network, load, and disk stats. Everything else is noise.

The Storage Bottleneck: Why NVMe Matters

Here is the uncomfortable truth about Prometheus: It eats IOPS for breakfast.

Prometheus uses a custom Time Series Database (TSDB). When you are ingesting 50,000 samples per second from a fleet of servers, the write pattern to the disk is intense. On traditional spinning rust (HDD) or cheap SATA SSD VPS providers, your monitoring server will choke. You will see gaps in your graphs. These gaps are where the truth was hiding.

Pro Tip: We migrated our internal monitoring stack to CoolVDS specifically for the NVMe storage backend. When compacting TSDB blocks, the random read/write speeds of NVMe prevent the monitoring server from stalling. If your monitoring tool is slower than the infrastructure it watches, you are flying blind.

Step 2: Configuring Prometheus for Scrape Efficiency

In your prometheus.yml, do not use the default 15-second global scrape interval unless you need that granularity. It generates massive data volumes. For standard infrastructure, 30 seconds is the sweet spot between resolution and retention.

global:
  scrape_interval:     30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'coolvds_nodes'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):9100'
        target_label: instance
        replacement: '${1}'

Predictive Alerting: The Killer PromQL

Stop alerting when the disk is 95% full. If you have a 2TB drive, 5% is 100GB—plenty of space. If you have a 10GB container, 5% is 500MB—panic time.

Instead, alert on time until full. This requires linear prediction. This is the exact query we use in Grafana 6.7 to wake us up only if the disk will fill up in the next 4 hours:

predict_linear(node_filesystem_free_bytes{job="coolvds_nodes"}[1h], 4 * 3600) < 0

This query looks at the trend over the last hour ([1h]) and projects it 4 hours into the future. If the result is negative (i.e., less than zero bytes free), it triggers. This saves your sleep schedule.

Data Sovereignty and The "Datatilsynet" Factor

Working in Norway means respecting privacy. Even system logs can contain PII (IP addresses, usernames). Sending this data to a cloud monitoring SaaS hosted in US-East-1 is a liability, especially with the current uncertainty regarding cross-border data transfers.

By hosting your Prometheus and ELK (Elasticsearch, Logstash, Kibana) stack on a CoolVDS instance in Oslo, you ensure:

  1. GDPR Compliance: Data never leaves the EEA.
  2. Low Latency: Monitoring probes originating from Oslo reflect the experience of your Norwegian user base. A check from Virginia telling you your Oslo site is slow might just mean the trans-Atlantic link is congested. A check from Oslo telling you the site is slow means the server is actually dying.

Visualizing the Truth

For visualization, we pair Prometheus with Grafana. Below is a snapshot of how we structure our "Health at a Glance" table in Grafana. It uses the `node_load1` metric divided by the count of CPUs to give a normalized load percentage.

MetricPromQL QueryThreshold (Warning)
Normalized Loadnode_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)> 1.0
I/O Waitrate(node_cpu_seconds_total{mode="iowait"}[5m])> 0.1 (10%)
Memory Saturation(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes> 0.9 (90%)

Conclusion: Don't Skimp on the Watchman

Infrastructure monitoring is your insurance policy. You wouldn't put a cheap padlock on a bank vault. Similarly, don't put your monitoring stack on shared, oversold hosting where "noisy neighbors" steal your CPU cycles during a crisis.

You need dedicated resources, predictable performance, and NVMe throughput to handle time-series data ingestion at scale. CoolVDS provides the raw power required to run a Prometheus/Grafana stack that is faster than the problems it detects.

Next Step: Don't let a silent failure kill your reputation. Deploy a dedicated monitoring instance on CoolVDS today. Spin up takes less than 55 seconds.