Silence is Fatal: Architecting Infrastructure Monitoring That Actually Works
Most monitoring setups are theater. They display pretty green dashboards while the disk I/O queue silently saturates, causing 500ms latency spikes that never trigger a single alert. I've seen it happen during Black Friday sales and critical election night traffic. The dashboard says "100% Uptime," but the checkout page is timing out.
If you are running infrastructure at scale—whether it's a Kubernetes cluster in Oslo or a fleet of monolithic VDS instances serving the Nordics—you cannot rely on default configurations. Defaults assume you don't care about the difference between "running" and "serving traffic efficiently."
In this guide, we are going to dismantle the "install and forget" mentality. We will build a monitoring stack that respects data sovereignty (hello, Datatilsynet), handles high cardinality without choking, and actually wakes you up when it matters.
The "False Positive" Trap of Cheap Infrastructure
Before we touch a single config file, understand this: your monitoring is only as reliable as the substrate it runs on. I once inherited a monitoring stack hosted on budget VPS providers in Frankfurt. The "steal time" (CPU stolen by the hypervisor) was so high that Prometheus itself couldn't scrape targets on time. We were getting false "Down" alerts simply because the monitoring server was gasping for air.
This is where the choice of provider becomes architectural, not just financial. For critical observability stacks, we use CoolVDS NVMe instances. Why? Because Time Series Databases (TSDBs) like Prometheus and VictoriaMetrics are I/O brutalists. They generate massive amounts of small writes (WAL - Write Ahead Logs). On standard spinning rust or shared SSDs with low IOPS limits, your monitoring lag increases. On CoolVDS, the NVMe arrays handle high ingestion rates without adding latency to the scrape loop.
Step 1: The Collector – Beyond Default Scrapes
Standard node_exporter is noisy. You don't need to know the temperature of every virtual sensor in a cloud environment. You do, however, need to know about pressure stalls.
By late 2025, eBPF has become accessible, but good old /proc data via PSI (Pressure Stall Information) is still the most reliable metric for "is my server hurting?" unlike simple CPU usage.
Here is how you should run node_exporter to avoid metric bloat while capturing what matters. Note the exclusion of useless collectors:
/usr/local/bin/node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.cpufreq \
--collector.diskstats \
--collector.filesystem \
--collector.loadavg \
--collector.meminfo \
--collector.netdev \
--collector.pressure \
--collector.sockstat \
--collector.stat \
--collector.systemd \
--web.listen-address=":9100"
Don't just run it. Verify it exposes pressure stalls:
curl -s localhost:9100/metrics | grep pressure
If you see node_pressure_cpu_waiting_seconds_total, you are ready. This metric tells you when processes are stalling because they are waiting for CPU time—a far better indicator of saturation than raw usage percentages.
Step 2: The Storage – Handling Long-Term Retention in Norway
Prometheus is fantastic for short-term operational data. It is terrible for long-term retention. If you want to compare this year's Christmas traffic to last year's, keeping that data in a local Prometheus instance is a ticking time bomb for your disk space.
Furthermore, if you are operating in Norway, you have GDPR and Schrems II considerations. Pumping metric data (which often inadvertently contains IP addresses or user IDs in labels) to a US-managed SaaS cloud is a compliance headache. The pragmatic CTO keeps this data on Norwegian soil.
We recommend a tiered approach: local Prometheus for real-time alerting (15 days retention), federating to a centralized VictoriaMetrics instance on a storage-optimized CoolVDS server.
Configuration: Prometheus Remote Write
Add this to your prometheus.yml on your edge nodes:
global:
scrape_interval: 15s
evaluation_interval: 15s
remote_write:
- url: "http://monitoring.your-coolvds-domain.no:8428/api/v1/write"
queue_config:
max_shards: 1000
capacity: 2500
max_samples_per_send: 500
write_relabel_configs:
- source_labels: [__name__]
regex: "^go_.*"
action: drop # Drop useless Go runtime metrics to save bandwidth
This setup pushes data to your central repository in Oslo. The write_relabel_configs section is crucial. Dropping go_* metrics (internal runtime stats of the exporter itself) can reduce your ingestion volume by 20% instantly.
Step 3: Alerting That Doesn't Suck
Alert fatigue kills capability. If your phone buzzes every night at 3 AM for a "High CPU" warning that resolves itself in 2 minutes, you will eventually ignore a real outage.
We use Alertmanager with inhibition rules. If a whole data center goes down, you don't need 50 alerts for every VM in that DC. You need one alert: "DC Offline."
Here is a robust alertmanager.yml configuration snippet:
route:
group_by: ['alertname', 'datacenter']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-ops'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
repeat_interval: 1h
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['instance']
The inhibit_rules block is the magic. If an instance throws a CRITICAL alert (e.g., "Host Down"), Alertmanager suppresses the WARNING alerts (e.g., "High Latency") for that same instance. This keeps your pager clean.
Pro Tip: Use `predict_linear` for disk space alerts. Instead of alerting when disk is 90% full (which might be fine for a static archive), alert when the disk will be full in 4 hours based on current write rates.
The query looks like this:
predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 4 * 3600) < 0
Network Latency: The Norwegian Context
Norway is a long country. The latency from Stavanger to a server in Oslo is different than from Tromsø. If your customers are local, your monitoring needs to reflect that.
We use the Blackbox Exporter to monitor latency specifically from the CoolVDS data center to key Norwegian infrastructure points (NIX, major ISPs like Telenor). This helps distinguish between "My app is slow" and "The internet in Northern Norway is congested."
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: [] # Defaults to 2xx
method: GET
no_follow_redirects: false
fail_if_ssl: true
fail_if_not_ssl: true
icmp_norway:
prober: icmp
timeout: 2s
icmp:
preferred_ip_protocol: "ip4"
Run this check every 10 seconds. Since CoolVDS has direct peering at NIX, any deviation in latency usually points to an external ISP issue rather than your infrastructure, giving you ammunition when customers complain.
Infrastructure as Code (Ansible)
Never configure this manually. Here is a snippet for an Ansible task to ensure your Prometheus user exists and is locked down, a security requirement often overlooked:
- name: Create Prometheus system user
user:
name: prometheus
system: yes
shell: /usr/sbin/nologin
home: /var/lib/prometheus
create_home: no
- name: Create Prometheus directories
file:
path: "{{ item }}"
state: directory
owner: prometheus
group: prometheus
mode: '0750'
loop:
- /etc/prometheus
- /var/lib/prometheus
Conclusion
Observability is not about collecting every metric; it is about collecting the right metrics and storing them on infrastructure that won't buckle under the write load. By using specialized flags in node_exporter, implementing smart inhibition in Alertmanager, and leveraging the high-speed NVMe storage of CoolVDS, you build a system that tells you the truth.
Don't let your monitoring be the single point of failure. Deploy a dedicated monitoring instance on CoolVDS today and see what your infrastructure is actually doing.