Surviving Scale: Building a Latency-Obsessed Monitoring Stack in 2018
If your monitoring system sends you an alert five minutes after your database has already crashed, you don't have a monitoring system. You have a notification service for your resume update.
I learned this the hard way during a Black Friday migration last year. We were relying on a legacy Zabbix setup with standard polling intervals. The load balancer fell over, but our dashboard showed green for another 180 seconds. By the time the SMS arrived, we had lost significant revenue and customer trust.
In 2018, infrastructure is too volatile for static checks. Containers spin up and die in seconds. We need time-series data, we need it in real-time, and we need to store it without violating the GDPR regulations that just hit us in May.
Here is how to build a battle-ready monitoring stack using Prometheus and Grafana, specifically tailored for the high-throughput demands of modern deployments in the Nordic region.
The I/O Bottleneck: Why Your TSDB is Slow
Time Series Databases (TSDBs) like Prometheus are write-heavy. They ingest thousands of data points per second. If you deploy this on a budget VPS with standard SSDs (or worse, spinning rust), your iowait will skyrocket. I have seen Prometheus instances stall simply because the underlying storage couldn't keep up with the ingestion rate of a modest Kubernetes cluster.
Pro Tip: Never colocate your monitoring stack on the same physical disk as your production database. If the database spirals and eats I/O, you lose visibility exactly when you need it most. This is why we isolate monitoring nodes on CoolVDS NVMe instances—the dedicated I/O throughput ensures metrics keep writing even when the world is burning.
The Architecture: Pull vs. Push
Unlike the old push-based models, Prometheus pulls metrics. Your services expose an HTTP endpoint (usually /metrics), and Prometheus scrapes them.
Why this matters for security: You don't need to open firewall ports on your monitoring server to the entire world. You only need to allow the monitoring server to reach your targets. If you are hosting within Norway, utilizing a provider with a robust local LAN or private networking option reduces latency and keeps traffic off the public internet.
Step 1: The Foundation (Docker Compose)
Let's set up a scraping stack. I'm assuming you are running Docker 18.06+ on a CentOS 7 or Ubuntu 18.04 host.
Create a docker-compose.yml file:
version: '3'
services:
prometheus:
image: prom/prometheus:v2.4.3
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention=15d'
ports:
- 9090:9090
restart: unless-stopped
grafana:
image: grafana/grafana:5.3.0
depends_on:
- prometheus
ports:
- 3000:3000
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=SecurePassword123!
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v0.16.0
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- --collector.filesystem.ignored-mount-points
- "^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)"
ports:
- 9100:9100
restart: always
volumes:
prometheus_data: {}
grafana_data: {}
Step 2: Configuration that actually scales
The default configuration is fine for a laptop, not for production. Create prometheus.yml. Note the scrape_interval. 15 seconds is the sweet spot for most web apps. 1 minute is too slow; 1 second creates massive storage overhead.
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['node-exporter:9100']
labels:
env: 'production'
region: 'no-oslo-1'
The GDPR Elephant in the Room
Since May 25th, everyone is panicking about data residency. While metric data (CPU usage, RAM) is generally not PII (Personally Identifiable Information), log data often is. If you expand this stack to include ELK (Elasticsearch, Logstash, Kibana) for logs, you are treading dangerous waters if that data leaves the EEA.
By hosting your monitoring infrastructure in Norway (outside the direct reach of US Cloud Act implications when using purely local providers), you simplify your compliance posture. Datatilsynet (The Norwegian Data Protection Authority) is strict. Don't give them a reason to audit you because you accidentally stored IP addresses in a bucket in Virginia.
Optimizing for Low Latency
If your users are in Oslo or Stockholm, your monitoring should be too. Network latency affects the timestamp accuracy of your scrapes. When we migrated a client from a Frankfurt AWS instance to a CoolVDS instance in Oslo, the jitter in their network graphs dropped by 12ms. That clarity allowed us to diagnose a micro-burst issue in their Nginx configuration.
Nginx Tuning for Metrics
To expose Nginx metrics to Prometheus, you need the ngx_http_stub_status_module. Once enabled, use the nginx-prometheus-exporter sidecar. But first, ensure your Nginx isn't the bottleneck.
Check your nginx.conf for these 2018 standards:
worker_processes auto;
worker_rlimit_nofile 65535;
events {
worker_connections 2048;
use epoll;
multi_accept on;
}
When to Upgrade?
This single-node stack works brilliantly up to about 500 targets. Beyond that, Prometheus memory usage will spike. At that point, you're looking at federation or remote storage adapters—topics for another day.
For now, the priority is visibility. The difference between a 99.9% SLA and a broken contract is often just disk speed. We use KVM virtualization on CoolVDS because containers can suffer from noisy neighbors in shared kernel environments. When you are writing 50k metrics a second, you want the kernel to yourself.
Next Steps:
- Audit your current polling interval. If it's >30s, change it.
- Check your disk I/O wait times. If
iowait> 5%, you need faster storage. - Deploy the stack above on a fresh instance.
Don't let slow I/O kill your observability. Deploy a test NVMe instance on CoolVDS in 55 seconds and see the difference raw speed makes.