You can't fix what you can't see.
Itβs 3:00 AM on a Tuesday. Your phone lights up. The load balancer is throwing 502 Bad Gateway errors, but your current monitoring dashboard says CPU usage is at a comfortable 40%. You check the logs. Nothing obvious. You restart the service. It works for ten minutes, then crashes again.
Welcome to the reality of "shallow monitoring."
In the wake of the GDPR enforcement that just hit us on May 25th, the stakes for data integrity and availability in Europe have never been higher. If you are hosting critical infrastructure here in Norway, you aren't just battling downtime anymore; you are battling compliance audits from Datatilsynet. As a System Architect who has spent the last decade debugging race conditions across distributed clusters, I can tell you that most default monitoring setups are garbage. They measure noise, not signal.
Today, we are going to build a monitoring stack that actually works. We will use the new Prometheus 2.0 storage engine, visualize it with Grafana 5, and talk about why hosting provider choice (specifically regarding "Steal Time") is the hidden killer of performance.
The "Push" vs. "Pull" Debate is Over
For years, we argued about Nagios vs. Zabbix vs. InfluxDB. In 2018, for dynamic infrastructure, the pull model wins. We use Prometheus. Why? Because I don't want my production servers burning CPU cycles trying to push metrics to a central server that might be down. I want a central scraper that collects metrics when it is ready.
| Feature | Push Model (e.g., Graphite/StatsD) | Pull Model (Prometheus) |
|---|---|---|
| Agent Load | Higher (Agent handles retry logic) | Lower (Agent just exposes HTTP endpoint) |
| Firewalls | Easier (Outbound only) | Harder (Requires inbound allow-list) |
| Service Discovery | Manual config usually | Native (DNS, Consul, K8s) |
If you are running on CoolVDS NVMe instances, you have the I/O throughput to handle high-resolution scraping (1s intervals) without even blinking. Try that on a standard spinning-disk VPS, and you'll create your own denial of service.
Step 1: The Exporter Strategy
Don't just install `node_exporter` and walk away. Configure it to ignore the noise. We don't need to monitor every single loopback interface or tmpfs mount. Here is a production-ready systemd unit file for `node_exporter` 0.16.0 that focuses on what matters.
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.meminfo \
--collector.loadavg \
--collector.filesystem \
--collector.netdev \
--collector.diskstats \
--web.listen-address=:9100
[Install]
WantedBy=multi-user.targetThis configuration strips out the bloat. We only care about CPU, memory, load, filesystem, and disk I/O.
Pro Tip: If you are hosting on a shared platform, watch the `node_cpu_seconds_total{mode="steal"}` metric. Steal time occurs when the hypervisor refuses to give your VM CPU cycles because other tenants are noisy. At CoolVDS, we use KVM virtualization with strict resource guarantees, so your steal time should sit at 0.0%. If you see it spiking above 2% on your current host, move. Now.
Step 2: Configuring the Scraper
Prometheus 2.0 brought massive performance improvements to the time-series database (TSDB). We can now ingest millions of samples per second. However, retention is key. If you are complying with strict Norwegian data retention policies, you might need to limit how long you keep logs versus metrics.
Here is a `prometheus.yml` configured for a typical Nordic setup, scraping two targets: your web tier and your database tier.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'coolvds-nodes'
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100']
labels:
region: 'no-oslo-1'
env: 'production'
- job_name: 'nginx'
static_configs:
- targets: ['10.0.0.5:9113'] # Nginx ExporterStep 3: Alerting on Symptoms, Not Causes
Stop alerting on "High CPU." High CPU is fine if you are rendering video or compiling code. Alert on "High Latency" or "Error Rate." That is what impacts the user. We use `Alertmanager` for this.
Here is a rule file `alerts.yml` that detects if your instance is actually unreachable or if latency is killing your SEO ranking.
groups:
- name: host-level
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute."
- alert: HighLoad
expr: node_load1 > (count by (instance) (node_cpu_seconds_total{mode="idle"})) * 2
for: 5m
labels:
severity: warning
annotations:
summary: "Host under high load"
description: "Load average is 2x the number of cores. Check for hung processes."
The Nginx/PHP-FPM Black Box
Linux metrics are great, but they don't tell you if PHP is hanging. In 2018, if you aren't monitoring your `php-fpm` active processes, you are flying blind. You need to enable the status page in your pool config:
; /etc/php/7.2/fpm/pool.d/www.conf
pm.status_path = /statusAnd map it in Nginx so only your local monitoring agent can hit it:
server {
listen 127.0.0.1:80;
server_name localhost;
location /status {
fastcgi_pass unix:/run/php/php7.2-fpm.sock;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
allow 127.0.0.1;
deny all;
}
}Why Infrastructure Choice Dictates Monitoring Success
You can have the best Grafana dashboards in the world, but if the underlying hardware has high I/O latency, your database will lock up. I've seen it happen with cheap VPS providers who oversell their SSD arrays.
This is why for data-intensive applications, we rely on CoolVDS. Their infrastructure in Oslo connects directly to NIX (Norwegian Internet Exchange), ensuring that when we monitor latency, we are measuring our code, not the distance to a data center in Frankfurt or Amsterdam. Plus, with the new GDPR rules, keeping data within Norwegian borders (or EEA) simplifies the legal headache significantly.
Latency Matters
Letβs look at a quick `ping` test from a CoolVDS instance in Oslo to major endpoints. Low latency isn't just about speed; it's about the reliability of TCP handshakes during high traffic.
# ping -c 4 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_seq=1 ttl=59 time=1.84 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=59 time=1.89 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=59 time=1.81 ms
64 bytes from 1.1.1.1: icmp_seq=4 ttl=59 time=1.85 ms
--- 1.1.1.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 1.812/1.849/1.892/0.048 msSub-2ms response times. That is the baseline you should expect. If your current provider is giving you 15ms to local gateways, you are already losing.
Conclusion
Monitoring is not about pretty charts. It is about knowing something is wrong before your customer calls you. It's about having the historical data to prove that the database crash was caused by an I/O bottleneck, not bad code.
Don't build your house on sand. Start with a solid foundation. Deploy a CoolVDS instance today, setup this Prometheus stack, and finally get a good night's sleep. Your future self (and your uptime reports) will thank you.