Stop Guessing: Building a Sovereign APM Stack on 20.04 LTS
Most developers treat monitoring as an afterthought. They throw an agent on the server, pipe data to a US-based SaaS dashboard, and call it a day. Then the invoice arrives. Or worse, the Datatilsynet (Norwegian Data Protection Authority) comes knocking asking where your user IPs are being processed.
After the Schrems II ruling in July, relyng on external cloud providers for deep application introspection isn't just a latency risk—it's a compliance minefield. If you are serving traffic in Oslo, your metrics should live in Oslo.
I've debugged production outages where the dashboard showed green, but the disk I/O was completely saturated. Why? Because the granularity was set to 5-minute averages. In 5 minutes, a server can die and restart three times. In this guide, we are building a high-resolution, self-hosted monitoring stack that respects data sovereignty and leverages the raw NVMe power available on CoolVDS instances.
The War Story: The "Ghost" 502 Errors
Last Black Friday, a client running a high-traffic Magento cluster started throwing 502 Bad Gateway errors. The load balancers looked fine. CPU usage was at 40%. RAM had plenty of headroom. Yet, customers were seeing white screens.
We were flying blind because our SaaS APM was sampling requests to save money. We missed the micro-bursts.
We SSH'd in and ran iostat -x 1. The SVCTM (Service Time) on the database disk was spiking to 200ms every few seconds. The culprit? Aggressive log rotation writing to the same spindle as the InnoDB buffer pool. If we had proper I/O monitoring with second-level granularity, we would have seen this weeks ago.
The Stack: Prometheus + Grafana on Local Infrastructure
We are going to use Prometheus for scraping and Grafana for visualization. Why self-hosted? Because Time Series Databases (TSDBs) are heavy on disk writes. On a shared standard HDD VPS, Prometheus will choke. This is where CoolVDS NVMe instances become the reference implementation. You need high IOPS to ingest thousands of metrics per second without lag.
1. System Tuning for High Throughput
Before installing anything, prepare your Linux kernel. By default, Ubuntu 20.04 isn't tuned for the massive number of open files a heavy monitoring stack requires.
Edit /etc/sysctl.conf:
# Increase the number of open files
fs.file-max = 2097152
# Adjust TCP keepalive to detect dead scrapers faster
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
# Allow more connections to complete
net.core.somaxconn = 65535
Apply it with sysctl -p. Don't skip this.
2. Deploying the Stack with Docker Compose
We will use Docker Compose. It’s portable and clean. Ensure you have Docker 19.03+ installed.
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.22.0
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle' # Allows hot reload
ports:
- 9090:9090
restart: always
grafana:
image: grafana/grafana:7.3.1
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=SecretPassword123!
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- 3000:3000
depends_on:
- prometheus
restart: always
node-exporter:
image: prom/node-exporter:v1.0.1
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- 9100:9100
restart: always
volumes:
prometheus_data:
grafana_data:
Notice the --web.enable-lifecycle flag on Prometheus? This lets you reload configuration without restarting the container using a simple curl command:
curl -X POST http://localhost:9090/-/reload
3. Configuring Prometheus Scrapers
Create a `prometheus/prometheus.yml` file. Here is where latency to Oslo matters. If your monitoring server is in Frankfurt and your app is in Norway, network jitter will pollute your data. Keep them close.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'nginx_prod'
metrics_path: '/metrics'
static_configs:
- targets: ['10.0.0.5:9113'] # Private IP of your App Server
Exposing Nginx Metrics
To get real data, you need your web server to talk. On your application server (the one CoolVDS hosts for you), enable the stub_status module in Nginx.
Inside your /etc/nginx/sites-enabled/default:
server {
listen 127.0.0.1:80;
server_name 127.0.0.1;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
}
Then, use nginx-prometheus-exporter to bridge this to Prometheus.
SaaS vs. Self-Hosted: The TCO Reality
Many CTOs argue that SaaS is cheaper because you don't manage the server. They forget about data transfer costs and