Surviving Cardinality Hell: A Battle-Tested Guide to Infrastructure Monitoring in 2025
It was 3:47 AM on a Tuesday when my phone vibrated off the nightstand. The alert wasn't for a downed server. It was for the monitoring system itself. Our Prometheus instance had OOM-killed (Out of Memory) because a junior dev deployed a microservice that tagged every HTTP request with a unique `user_id` label.
We hit 4 million active time series in thirty seconds. The scraper choked. Grafana dashboards flatlined. We were flying blind.
If you manage infrastructure at scale, you know this pain. In 2025, deploying a Kubernetes cluster is trivial, but keeping eyes on it without generating terabytes of useless noise is an art form. This isn't a "Getting Started" guide. This is how you engineer observability when your infrastructure spans hundreds of nodes and reliability is non-negotiable.
The Hidden Cost of Metric Cardinality
Most VPS providers won't tell you this, but CPU stealing (noisy neighbors) kills monitoring precision. If your time-series database (TSDB) can't write to disk fast enough because the host node is oversubscribed, you get gaps in your graphs. You think it's a network blip; actually, it's cheap hosting.
When we built the reference architecture for CoolVDS, we enforced strict isolation on NVMe I/O queues specifically to handle heavy write-ahead log (WAL) operations typical in Prometheus or VictoriaMetrics setups. But hardware is only half the battle. You need to fix your configs.
1. Ruthless Relabeling
The fastest way to kill a Prometheus server is high cardinality—too many unique combinations of label values. You must drop unnecessary labels at the ingestion point.
Here is a snippet from a production `prometheus.yml` configuration used to strip high-cardinality noise from a Kubernetes ingress controller:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# DROP specific high-cardinality labels before storage
- action: labeldrop
regex: (uid|container_hash|image_sha)
# KEEP only pods that actually need monitoring
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
metric_relabel_configs:
# DANGEROUS: Dropping metrics that consume 80% of storage but provide 0 value
- source_labels: [__name__]
regex: 'apiserver_request_duration_seconds_bucket'
action: drop
Pro Tip: Run the following PromQL query to identify which metrics are eating your storage:topk(10, count by (__name__)({__name__=~".+"})). You will likely find thatnode_cpu_seconds_totalor similar raw metrics are generating excessive series due to granular CPU core reporting. filter them.
Storage IOPS: The Bottleneck No One Talks About
In 2025, we generate roughly 10x the telemetry data we did five years ago. OpenTelemetry traces, eBPF probes, and standard metrics create a massive write load. If you are running this on standard HDD or even SATA SSD based VPS, you will see `WAL corruption` errors eventually. It’s not if, it’s when.
A Time Series Database appends data sequentially but merges files in the background (compaction). This requires high random I/O performance.
| Storage Type | Sequential Write (MB/s) | Random Read (IOPS) | Verdict for Monitoring |
|---|---|---|---|
| Standard HDD | 120 | ~80 | Failure (Will lag behind ingestion) |
| SATA SSD (Shared) | 450 | ~5,000 | Risky (Compaction kills performance) |
| CoolVDS NVMe | 3,500+ | ~50,000+ | Production Ready |
Network Latency and the "Oslo Edge"
For Norwegian businesses, the location of your monitoring server is critical. If your infrastructure is in Oslo but your monitoring stack is in Frankfurt, you introduce a ~25ms round-trip delay. For standard HTTP checks, this is negligible. For distributed tracing where you are aggregating spans from microservices, network jitter adds up.
Hosting your monitoring stack on a VPS Norway instance connected directly to NIX (Norwegian Internet Exchange) ensures that your "internal" latency checks are accurate. You aren't measuring the internet; you are measuring your application.
Implementing a Blackbox Exporter
Don't trust localhost. Use a Blackbox exporter to ping your services from the "outside" (or a separate VLAN). Here is a robust Docker Compose setup for 2025-era monitoring, stripping away the bloat:
services:
prometheus:
image: prom/prometheus:v2.54.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
# Critical for keeping memory usage low on high-load instances
- '--storage.tsdb.retention.time=15d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
deploy:
resources:
limits:
memory: 4G
node-exporter:
image: prom/node-exporter:v1.8.2
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
volumes:
prometheus_data:
Compliance: The GDPR Elephant
Since Schrems II and the subsequent tightening of Datatilsynet regulations, exporting log data containing PII (Personally Identifiable Information) outside the EEA—or even outside Norway for strict sectors—is a legal minefield. Access logs often contain IP addresses. If you ship those logs to a US-cloud-based SaaS monitoring tool, you are technically exporting data.
Self-hosting your stack on CoolVDS keeps the data strictly within Norwegian jurisdiction. You own the disk, you own the encryption keys, and the data never leaves the Oslo data center unless you say so. This simplifies your ROPA (Record of Processing Activities) significantly.
The Architecture of Resilience
Reliability isn't about buying the most expensive tool; it's about reducing complexity. A single, well-tuned Prometheus instance running on a high-performance NVMe VPS often outperforms a complex, distributed federated setup that no one on your team understands how to debug.
If you are tired of wondering why your dashboards are lagging or why your I/O wait is spiking during query execution, it's time to look at the metal underneath your metrics.
Don't let slow I/O kill your observability. Deploy a high-frequency NVMe instance on CoolVDS in 55 seconds and see what you've been missing.