The 3:00 AM Wake-Up Call You Could Have Avoided
It’s 03:14 AM on a Tuesday. Your phone vibrates off the nightstand. It’s PagerDuty. The alert isn't helpful: CRITICAL: Server Unreachable. By the time you ssh in—if you can even connect—the load average is 45.0, the logs are rotating so fast they're unreadable, and you have no historical data to see what triggered the avalanche. You restart services blindly, hoping it holds until morning.
We've all been there. But in 2022, "I don't know what happened" is not an acceptable RCA (Root Cause Analysis). If you are running infrastructure in Europe, particularly within the strict regulatory environment of Norway, you need total visibility. You need to know a disk was filling up three days ago, not when it hits 100% inode usage.
In this guide, we aren't discussing expensive SaaS solutions that send your metric data across the Atlantic, violating Schrems II compliance. We are building a battle-tested, self-hosted monitoring stack using Prometheus and Grafana that stays right here on the continent.
The Observer Effect: Don't Kill Your Server While Watching It
The first rule of monitoring is: do not impact production performance. I have seen poorly configured agents consume 30% of CPU just to tell the admin that CPU usage is high. This is usually due to I/O bottlenecks. Time Series Databases (TSDB), like the one Prometheus uses, are notoriously heavy on disk writes. Every metric, every second, is a write operation.
Pro Tip: Never host your monitoring stack on the same physical disk as your production database. When Prometheus starts block compaction, it will choke your MySQL I/O. We recommend isolating monitoring on a dedicated CoolVDS NVMe instance to handle the high IOPS requirements of TSDB without stealing cycles from your app.
Step 1: The Foundation (Prometheus)
We will use Docker for portability, though bare metal configuration via Ansible is valid for larger setups. Specifically, we are looking at Prometheus v2.37. We need to configure the retention period carefully—defaulting to 15 days is fine for debugging, but for trend analysis, you often want months.
Here is a production-ready docker-compose.yml snippet that limits resource usage:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.37.0
container_name: prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
ports:
- 9090:9090
deploy:
resources:
limits:
memory: 2G
restart: always
node_exporter:
image: prom/node-exporter:v1.3.1
container_name: node_exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- 9100:9100
restart: always
Configuration Nuances
The node_exporter is lightweight, but standard configurations often miss the nuances of virtualized environments. If you are running on a VPS, you need to track CPU Steal time. This metric tells you if your "noisy neighbors" are affecting your performance.
Add this to your prometheus.yml scrape configs:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-node'
static_configs:
- targets: ['node_exporter:9100']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'node_cpu_seconds_total'
action: keep
Step 2: Visualizing the Invisible (Grafana)
Raw metrics are useless without context. Grafana (we're using v9.0 here) connects to Prometheus to render this data. But don't just import the default dashboards. They are often cluttered.
For a Norwegian or European context, latency is a critical metric. You should be monitoring the connection quality from your server to major internet exchanges (like NIX in Oslo or AMS-IX in Amsterdam). A server that is up but unreachable due to packet loss is effectively down.
Use the Blackbox Exporter to probe endpoints. Here is how you visualize ICMP (Ping) latency in Grafana using PromQL:
# Average latency over the last 5 minutes
rate(probe_duration_seconds{job="blackbox"}[5m])
If you see spikes in this graph specifically during peak Nordic business hours (08:00 - 16:00 UTC+1), your provider's uplink is congested. This is a common issue with budget hosting. CoolVDS mitigates this by peering directly at major exchanges, ensuring low latency even when traffic is heavy.
The Storage Bottleneck: Why NVMe Matters
This is the technical reality that trips up most DevOps engineers. As your metric cardinality grows (tracking per-container, per-endpoint, or per-user metrics), the random write operations to your disk skyrocket.
| Storage Type | Avg Write IOPS | TSDB Performance |
|---|---|---|
| Standard HDD (7.2k) | 80-120 | Fail at < 10k metrics |
| SATA SSD | 5,000-10,000 | Acceptable for small clusters |
| NVMe (CoolVDS Standard) | 200,000+ | Handles high-cardinality at scale |
When Prometheus compacts data blocks (merging recent data into long-term storage), it creates a massive I/O burst. On standard SSDs, this causes "I/O Wait," freezing your monitoring dashboard exactly when you might need it most. We standardized on NVMe for all our VPS tiers specifically to prevent this locking behavior.
Data Sovereignty & GDPR
In the wake of Schrems II, relying on US-based cloud monitoring for sensitive infrastructure logs is a legal minefield. By hosting your own stack on a VPS in Norway, you simplify compliance. You know exactly where the data lives. It doesn't leave the partition unless you tell it to.
The "Battle-Ready" Checklist
Before you close your terminal, verify these three things:
- Alertmanager is configured: Don't just collect data. Route critical alerts to Slack or PagerDuty.
- Firewall Rules: Ensure port
9090and9100are NOT exposed to the public internet. Use a VPN or reverse proxy with Basic Auth. - Resource Limits: Docker containers for monitoring must have memory limits. Prometheus will eat all available RAM for caching if you let it.
Monitoring is not a "set and forget" task. It is an active part of your infrastructure defense. It requires hardware that can keep up with the write intensity and network throughput of modern stacks. Don't let slow I/O blind you to critical failures.
Need a sandbox to test your Prometheus stack? Deploy a CoolVDS NVMe instance in Oslo today. It takes 55 seconds to spin up, and the latency is rock bottom.