Zero-Blindspot Infrastructure: Scaling Prometheus Monitoring in 2022
It was 3:42 AM on a Tuesday when the pager screamed. Our primary load balancer in Oslo hadn't failed; it had simply stopped processing requests while reporting "healthy" status checks. The CPU load was normal. Memory usage was fine. But the connection table was saturated because of a micro-burst DDoS that our standard sampling interval missed entirely. If you rely on 5-minute averages or simple up/down checks, you are flying blind. In the Nordic hosting market, where reliability is the currency we trade in, that level of blindness is unacceptable. I have spent the last decade debugging distributed systems across Europe, and the lesson is always the same: if you cannot query the state of your infrastructure at a one-second resolution, you do not understand your infrastructure.
The "Four Golden Signals" Are Not Optional
Googleβs SRE book evangelized the concept of the Four Golden Signals: Latency, Traffic, Errors, and Saturation. However, implementing this on bare metal or VPS infrastructure requires more than just installing an agent. In 2022, the standard stack is Prometheus for scraping and Grafana for visualization. This combination is robust, but it defaults to configurations that will kill your disk I/O as soon as you scale past a few dozen nodes. We need to look at specific configurations to make this viable for production environments handling real traffic.
When you deploy a monitor, you are essentially deploying a write-heavy database. Time Series Databases (TSDBs) like the one inside Prometheus rely heavily on disk performance. I have seen perfectly good monitoring setups crash because the underlying storage couldn't handle the IOPS of writing 50,000 metrics per second. This is where hardware selection becomes critical. Using spinning rust (HDDs) or shared, throttled storage for your monitoring instance is a recipe for data gaps. This is why we standardize on NVMe storage for all CoolVDS instances; when you are ingesting metrics from a Kubernetes cluster or a swarm of Nginx nodes, the write latency needs to be negligible.
1. Configuring the Node Exporter for Real Data
The standard node_exporter is noisy. It collects everything by default. To scale, you need to disable collectors you don't use and enable the ones that actually help debug saturation.
Here is a systemd service file optimized for a high-traffic node. Note the disabling of the wifi and zfs collectors (unless you use ZFS) to save CPU cycles.
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.disable-defaults \
--collector.cpu \
--collector.meminfo \
--collector.loadavg \
--collector.filesystem \
--collector.netdev \
--collector.netstat \
--collector.diskstats \
--collector.filefd \
--web.listen-address=:9100
[Install]
WantedBy=multi-user.target
By explicitly defining collectors, we reduce the payload size of every scrape. This might seem trivial on one server, but when you are scraping 500 VPS instances across Europe, that bandwidth adds up. Speaking of scraping, let's look at the Prometheus configuration.
Scraping Strategy and Federation
A common mistake is having a single Prometheus server try to scrape the entire world. It works until it runs out of RAM. The 2022 best practice is functional sharding or federation. You should have a Prometheus instance inside your CoolVDS environment in Oslo scraping your Norwegian nodes, and another in Frankfurt for your Central European nodes. You can then use a central Grafana to query both data sources, or a federated Prometheus to aggregate specific high-level metrics.
Pro Tip: Never expose your exporters to the public internet. Use a VPN (WireGuard is excellent and low-overhead) or strict iptables rules. If you host on CoolVDS, utilize the private networking VLANs to keep scrape traffic off your public interface.
Here is a robust prometheus.yml configuration that handles service discovery via a file mechanism (easier to manage with Ansible/Chef than static configs) and sets aggressive scrape intervals for critical jobs.
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
region: 'no-oslo-1'
env: 'production'
scrape_configs:
- job_name: 'coolvds-nodes'
scrape_interval: 10s
scrape_timeout: 5s
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.json'
relabel_configs:
- source_labels: [__address__]
regex: '(.*):9100'
target_label: instance
replacement: '${1}'
- source_labels: [__meta_filepath]
regex: '.*targets/(.*).json'
target_label: service_type
replacement: '${1}'
- job_name: 'nginx-ingress'
scrape_interval: 5s
static_configs:
- targets: ['10.0.0.5:9113', '10.0.0.6:9113']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'nginx_http_requests_total'
action: keep
In the config above, notice the scrape_interval: 5s for Nginx. You cannot detect micro-bursts with a 60-second interval. You need high resolution. However, high resolution generates massive amounts of data blocks. This brings us back to storage. If your VPS has low IOPS (common with budget providers overselling HDD arrays), your Prometheus will lag behind real-time, rendering your alerting useless.
2. Application Instrumentation
Infrastructure metrics only tell half the story. You need to know what the application is doing. If you are running Nginx, you must enable the stub_status module to get connection counts.
Add this to your Nginx block:
location /metrics {
stub_status on;
allow 127.0.0.1;
deny all;
}
Then run the nginx-prometheus-exporter sidecar to translate that into Prometheus format. This gives you visibility into Active connections vs Reading vs Writing. High "Writing" usually indicates slow clients or network latency downstream.
Alerting: Reducing Pager Fatigue
Alerting on "CPU > 90%" is a rookie move. A database server might run efficiently at 95% CPU for hours during a backup cycle. You should alert on Error Budgets and Saturation. Use Alertmanager to route these intelligently.
We use PromQL (Prometheus Query Language) to define alerts that actually matter. For example, predicting disk fill-up time rather than just alerting when it's full.
groups:
- name: node_alerts
rules:
- alert: DiskWillFillIn4Hours
expr: predict_linear(node_filesystem_free_bytes{fstype!="tmpfs"}[1h], 4 * 3600) < 0
for: 5m
labels:
severity: warning
annotations:
summary: "Disk is filling up fast on {{ $labels.instance }}"
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 1
for: 2m
labels:
severity: critical
annotations:
summary: "High 5xx error rate detected"
The predict_linear function is powerful. It looks at the trend of the last hour and calculates if the disk will be empty in 4 hours. This gives you time to react, rather than waking you up when the server has already crashed.
3. Visualizing with Grafana
Data without visualization is just noise. Grafana v8 (current stable) offers powerful visualization panels. When building dashboards for a Norwegian context, consider latency maps. If your users are in Oslo, you want to measure latency from a probe in Oslo, not from a server in Virginia.
Here is a snippet of PromQL you would use in a Grafana panel to calculate the 99th percentile of request duration, which is far more useful than the average:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
The Compliance Angle: GDPR and Datatilsynet
We cannot discuss infrastructure in 2022 without addressing the elephant in the room: Schrems II and GDPR. The Norwegian Data Protection Authority (Datatilsynet) has made it clear that transferring personal data to non-adequate third countries is risky. Monitoring data often contains IP addresses, user IDs, or URL parameters that classify as PII (Personally Identifiable Information).
By hosting your monitoring stack on a US-owned SaaS cloud, you might be inadvertently violating data export laws. Hosting your own Prometheus instance on CoolVDS servers physically located in Norway ensures data sovereignty. Your metrics stay within the EEA (European Economic Area), simplifying your compliance posture significantly. The "Pragmatic CTO" knows that the cost of a compliance breach far outweighs the cost of managing a few Linux servers.
4. Securing the Transport
If you are pushing metrics across the public internet (e.g., from a branch office to your central CoolVDS instance), you must use TLS. Here is how you generate a basic config for the Prometheus web-config.yml (feature available in recent versions) to enable TLS:
tls_server_config:
cert_file: /etc/prometheus/tls/server.crt
key_file: /etc/prometheus/tls/server.key
client_auth_type: RequireAndVerifyClientCert
client_ca_file: /etc/prometheus/tls/ca.crt
This enforces mutual TLS (mTLS). Only exporters with the correct client certificate can push data, and only the authorized scraper can pull it. It adds complexity, but in a security-conscious environment, it is mandatory.
Final Thoughts: Performance is a Feature
Building a monitoring system that can ingest 100,000 samples per second requires respecting the physics of the hardware. CPU steal time on noisy neighbor environments will cause gaps in your graphs. Slow I/O will cause alerting delays.
At CoolVDS, we don't oversell our cores, and we use enterprise-grade NVMe storage because we know that when you are debugging a production outage, every millisecond of query latency counts. Don't let your monitoring tool be the bottleneck that hides the root cause.
Ready to build a monitoring stack that survives the spike? Deploy a high-memory, NVMe-backed instance on CoolVDS today and start seeing what is actually happening inside your infrastructure.