Console Login

The Four Golden Signals: Architecting a Resilient Monitoring Stack on Norwegian Soil

The Four Golden Signals: Architecting a Resilient Monitoring Stack on Norwegian Soil

It was 3:42 AM on a Tuesday when my phone buzzed. Not a call, but a PagerDuty alert. The frontend was throwing 502 Bad Gateways. By the time I logged into the server, the CPU load was normal, RAM was fine, and the disk had space. Yet, the service was dead.

The culprit? Entropy starvation and a exhausted connection pool. Standard "green light" dashboards missed it completely because they were looking at averages, not spikes. If you are relying on your hosting provider's default usage graphs to judge system health, you are flying blind.

In the world of high-availability systems, specifically here in the Nordics where customers expect banking-grade reliability, passive monitoring is negligence. Today, we are going to build a monitoring stack that actually works using Prometheus and Grafana, focusing on the "Four Golden Signals" defined by Google SREs: Latency, Traffic, Errors, and Saturation. And we are going to do it with strict adherence to GDPR, keeping our data right here in Norway.

Why Your Current Monitoring Lies to You

Most VPS providers give you a nice graph showing 5-minute averages. A 5-minute average is a lie. A CPU spike that lasts 20 seconds and locks up your database write-ahead log (WAL) will vanish into a 5-minute average. You need resolution. You need 15-second scrape intervals.

However, high-resolution monitoring creates a massive I/O penalty. Time Series Databases (TSDBs) like Prometheus generate thousands of small write operations per second. On standard SATA SSDs (or worse, spinning rust), your monitoring stack will become the bottleneck. This is why I run my observability stacks on CoolVDS NVMe instances. The I/O throughput isn't just a luxury; for TSDBs, it is a requirement.

Step 1: The Foundation (Prometheus + Node Exporter)

We aren't using SaaS solutions like Datadog here. Why? Data sovereignty. With the European Court of Justice scrutinizing data transfers (the case C-311/18 is looming over us), relying on US-based aggregators is a compliance risk I'm not willing to take for sensitive system logs. We keep it self-hosted, we keep it local.

Let's set up the node_exporter to expose hardware metrics. This binary needs to run on every target machine.

# Create a user for the exporter
useradd --no-create-home --shell /bin/false node_exporter

# Download version 1.0.0 (released May 2020)
wget https://github.com/prometheus/node_exporter/releases/download/v1.0.0-rc.0/node_exporter-1.0.0-rc.0.linux-amd64.tar.gz

# Extract and run
tar xvf node_exporter-1.0.0-rc.0.linux-amd64.tar.gz
cp node_exporter-1.0.0-rc.0.linux-amd64/node_exporter /usr/local/bin/node_exporter

# Create systemd service
cat < /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload && systemctl start node_exporter

Step 2: Configuring Prometheus for Aggressive Scraping

Now, we configure the Prometheus server. This is where storage speed matters. If you set scrape_interval to 15s or 10s, you are writing heavy blocks to disk constantly.

Here is a production-ready prometheus.yml. Note the use of file_sd_configs. In a dynamic environment (like auto-scaling groups), hardcoding IPs is a recipe for failure. We use file-based service discovery to dynamically update targets without restarting Prometheus.

global:
  scrape_interval: 15s 
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node_exporter_metrics'
    file_sd_configs:
      - files:
        - 'targets/*.json'
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):9100'
        target_label: instance
        replacement: '$1'
Pro Tip: Never expose your Prometheus dashboard (port 9090) to the public internet. Use an Nginx reverse proxy with basic auth, or better yet, tunnel it through a VPN. Security by obscurity is not a strategy.

Step 3: Visualizing with Grafana 7.0

Grafana v7.0 just dropped (May 2020), and the new tracing UI is fantastic. But we are here for the dashboards. Connect Grafana to your local Prometheus datasource.

To really test your underlying infrastructure performance, we need to look at I/O Wait. This metric tells you if your CPU is sitting idle waiting for the disk to read data. High I/O wait is the silent killer of database performance.

Use this PromQL query to visualize I/O wait on your CoolVDS instance:

avg(irate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) * 100

If this number goes above 5% consistently, your storage is too slow for your workload. This is common on budget "cloud" VPS providers that oversell their storage arrays.

The "Norway Factor": Latency and Compliance

Why host this in Norway? Aside from the obvious Datatilsynet requirements regarding personal data (and yes, IP addresses in logs are personal data), there is the physics of latency. If your users are in Oslo, Bergen, or Trondheim, routing your monitoring alerts through a data center in Frankfurt adds milliseconds. In high-frequency trading or real-time bidding, those milliseconds cost money.

Running a traceroute from a CoolVDS instance in Oslo to the NIX (Norwegian Internet Exchange) usually shows sub-2ms latency. Compare that to 35ms+ for US East.

Comparison: Storage Tech for Monitoring

Storage Type IOPS (Approx) Suitability for TSDB
Standard HDD 80-120 Unusable for production Prometheus
SATA SSD 5,000-10,000 Acceptable for small clusters
CoolVDS NVMe 20,000+ Ideal for high-resolution scraping

Detecting Saturation with Blackbox Exporter

Internal metrics aren't enough. You need to know what the user sees. The blackbox_exporter allows Prometheus to probe endpoints over HTTP, DNS, TCP, and ICMP.

Here is how to configure a probe to check if your API is responding fast enough (under 200ms):

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200, 201, 204]
      fail_if_body_not_matches_regexp:
        - "status": "ok"
      method: GET

Combine this with an Alertmanager rule. If latency > 0.5s for more than 1 minute, fire an alert.

Conclusion

Monitoring is not just about pretty graphs; it is about sleep hygiene. A properly tuned stack tells you about a problem before the customers start tweeting about it. But remember, software is only as fast as the hardware it runs on. A heavy Prometheus stack will choke a standard VPS.

If you are serious about observability, you need infrastructure that respects the I/O demands of modern TSDBs. Don't let disk latency be the reason you miss a critical alert.

Ready to build a monitoring stack that doesn't falter under load? Deploy a high-performance NVMe instance on CoolVDS today and see the difference raw I/O power makes.