Surviving the Spike: Architecting Prometheus for High-Scale Infrastructure
Silence isn't golden. In the world of systems administration, silence is suspicious. If your pager isn't going off, it usually means one of two things: either your infrastructure is running perfectly (unlikely), or your monitoring system has quietly died while your servers are burning down.
I learned this the hard way during the 2019 holiday rush. We were managing a high-traffic e-commerce cluster for a Nordic retailer. Our dashboard showed green across the board. CPU usage was nominal. Memory had headroom. Yet, checkout requests were timing out. Why? We were monitoring averages, not percentiles, and our storage subsystem was thrashing due to noisy neighbors on a budget public cloud provider. We were blind to the I/O wait times until the database completely locked up.
That incident changed how I approach observability. Today, we aren't just looking at "is it up?" We are looking at granular, high-resolution metrics that tell us how it is living. In this guide, I will walk you through setting up a robust, scalable monitoring stack using Prometheus and Grafana, specifically tailored for the regulatory and technical realities of late 2021.
The Stack: Why Prometheus?
By now, Prometheus is the de facto standard for cloud-native monitoring. Unlike the push-based legacy systems (looking at you, Nagios), Prometheus uses a pull model. It scrapes metrics from your endpoints. This is crucial for security architectures in 2021 because you don't need to open inbound ports on your monitoring server to the world; you only need your monitoring server to reach your nodes.
However, Prometheus is a beast when it comes to disk I/O. It writes thousands of data points per second. If you attempt to run a production-grade Prometheus instance on standard SATA SSDs—or worse, spinning rust—you will hit a write bottleneck long before you hit a CPU limit. This is why for our internal stacks, and for any client asking for recommendations, we deploy on CoolVDS NVMe instances. The random write performance of NVMe is not a luxury here; it is a requirement for a Time Series Database (TSDB) that needs to ingest metrics from 50+ nodes simultaneously.
Step 1: The Exporter Strategy
You cannot improve what you cannot measure. On Linux, the node_exporter is your eyes and ears. Do not just apt-get install and forget it. The default settings expose too much and parse too little.
Here is a production-hardened systemd service file for node_exporter. Note the collector flags. We are disabling things we don't need (like WiFi or Infiniband) to save CPU cycles and reducing the noise in our TSDB.
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.disable_defaults \
--collector.cpu \
--collector.meminfo \
--collector.filesystem \
--collector.netdev \
--collector.loadavg \
--collector.diskstats \
--collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($|/)"
[Install]
WantedBy=multi-user.target
Step 2: Configuring Prometheus for Federation
As you scale beyond a single data center—perhaps you have a cluster in Oslo for low latency to Norwegian customers and another in Frankfurt for the broader EU—a single Prometheus server becomes a single point of failure. In 2021, the best practice is Federation.
You run a "slave" Prometheus in each data center (DC) that scrapes local nodes. Then, you have a "master" Prometheus (ideally hosted on a high-availability CoolVDS instance) that scrapes only the aggregated data from the slaves. This keeps your cross-DC bandwidth usage low and alerts fast.
Here is how you configure the master prometheus.yml to scrape a slave:
scrape_configs:
- job_name: 'federate-oslo-dc1'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="node"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- '10.20.30.40:9090' # Private IP of your Oslo CoolVDS node
Pro Tip: Always use Private Networking for scraping metrics between servers in the same facility. It reduces latency and keeps your metrics off the public internet. CoolVDS offers unmetered private networks which is a lifesaver for heavy monitoring traffic.
Step 3: Storage Retention and Performance
By default, Prometheus keeps data for 15 days. For compliance with certain audit logs or just for capacity planning, you might need 90 days or more. However, increasing retention increases the size of the TSDB blocks on disk.
If you are running this on a VPS with shared spinning disks, increasing retention will kill your performance during the "compaction" phase (where Prometheus merges data blocks). This is where the hardware underlying your VPS matters.
| Storage Type | Compaction Speed (Approx) | Risk of I/O Wait |
|---|---|---|
| HDD / Hybrid | Slow | High (Metrics loss possible) |
| Standard SSD | Medium | Moderate |
| CoolVDS NVMe | Instant | Near Zero |
To adjust retention, modify your startup flags:
/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--storage.tsdb.retention.time 90d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
The Compliance Elephant: Schrems II and Data Sovereignty
Since the Schrems II ruling last year (2020), moving data between the EU/EEA and the US has become a legal minefield. You might think, "It's just server metrics, who cares?"
But consider this: your web server access logs (often ingested by Promtail or Logstash) contain IP addresses. Under GDPR, IP addresses are PII (Personally Identifiable Information). If you are shipping these logs to a US-based SaaS monitoring platform, you are potentially violating GDPR unless you have watertight Standard Contractual Clauses (SCCs) and supplementary measures in place.
This is why the "Pragmatic CTO" chooses to host monitoring infrastructure locally. Hosting your Prometheus stack on a VPS in Norway (like CoolVDS) keeps the data within the EEA, satisfying Datatilsynet requirements and ensuring you aren't waking up to a fine.
Visualizing with Grafana
Finally, we need to visualize this data. Install Grafana 8.2 (the current stable release). When connecting it to Prometheus, ensure you set the scrape interval in Grafana to match your Prometheus config to avoid aliasing artifacts in your graphs.
Here is a snippet of a PromQL query I use to detect "CPU Steal"—a silent killer in virtualized environments. CPU Steal happens when the hypervisor is busy serving other tenants and delays your VM. If this goes above 5%, you need to move hosts.
rate(node_cpu_seconds_total{mode="steal"}[5m]) * 100
At CoolVDS, we use KVM (Kernel-based Virtual Machine) with strict resource scheduling. We rarely see steal time go above 0.1%, but on budget hosts, I've seen this hit 20%, causing massive application latency that no code optimization could fix.
Conclusion
Monitoring at scale is about reducing noise and increasing signal. It requires a solid architecture, the right configuration, and hardware that doesn't choke on writes. By building your own stack on compliant, high-performance NVMe infrastructure, you gain total control over your data and your uptime.
Don't wait for the next outage to realize your monitoring is insufficient. Deploy a CoolVDS NVMe instance today, install Prometheus, and finally see what your servers are really doing.