Surviving Cardinality Hell: A Battle-Tested Guide to Infrastructure Monitoring in 2025

It was 3:47 AM on a Tuesday when my phone vibrated off the nightstand. The alert wasn't for a downed server. It was for the monitoring system itself. Our Prometheus instance had OOM-killed (Out of Memory) because a junior dev deployed a microservice that tagged every HTTP request with a unique `user_id` label.

We hit 4 million active time series in thirty seconds. The scraper choked. Grafana dashboards flatlined. We were flying blind.

If you manage infrastructure at scale, you know this pain. In 2025, deploying a Kubernetes cluster is trivial, but keeping eyes on it without generating terabytes of useless noise is an art form. This isn't a "Getting Started" guide. This is how you engineer observability when your infrastructure spans hundreds of nodes and reliability is non-negotiable.

The Hidden Cost of Metric Cardinality

Most VPS providers won't tell you this, but CPU stealing (noisy neighbors) kills monitoring precision. If your time-series database (TSDB) can't write to disk fast enough because the host node is oversubscribed, you get gaps in your graphs. You think it's a network blip; actually, it's cheap hosting.

When we built the reference architecture for CoolVDS, we enforced strict isolation on NVMe I/O queues specifically to handle heavy write-ahead log (WAL) operations typical in Prometheus or VictoriaMetrics setups. But hardware is only half the battle. You need to fix your configs.

1. Ruthless Relabeling

The fastest way to kill a Prometheus server is high cardinality—too many unique combinations of label values. You must drop unnecessary labels at the ingestion point.

Here is a snippet from a production `prometheus.yml` configuration used to strip high-cardinality noise from a Kubernetes ingress controller:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # DROP specific high-cardinality labels before storage
      - action: labeldrop
        regex: (uid|container_hash|image_sha)
      
      # KEEP only pods that actually need monitoring
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
        
    metric_relabel_configs:
      # DANGEROUS: Dropping metrics that consume 80% of storage but provide 0 value
      - source_labels: [__name__]
        regex: 'apiserver_request_duration_seconds_bucket'
        action: drop

Pro Tip: Run the following PromQL query to identify which metrics are eating your storage: topk(10, count by (__name__)({__name__=~".+"})). You will likely find that node_cpu_seconds_total or similar raw metrics are generating excessive series due to granular CPU core reporting. filter them.

Storage IOPS: The Bottleneck No One Talks About

In 2025, we generate roughly 10x the telemetry data we did five years ago. OpenTelemetry traces, eBPF probes, and standard metrics create a massive write load. If you are running this on standard HDD or even SATA SSD based VPS, you will see `WAL corruption` errors eventually. It’s not if, it’s when.

A Time Series Database appends data sequentially but merges files in the background (compaction). This requires high random I/O performance.

Storage Type	Sequential Write (MB/s)	Random Read (IOPS)	Verdict for Monitoring
Standard HDD	120	~80	Failure (Will lag behind ingestion)
SATA SSD (Shared)	450	~5,000	Risky (Compaction kills performance)
CoolVDS NVMe	3,500+	~50,000+	Production Ready

Network Latency and the "Oslo Edge"

For Norwegian businesses, the location of your monitoring server is critical. If your infrastructure is in Oslo but your monitoring stack is in Frankfurt, you introduce a ~25ms round-trip delay. For standard HTTP checks, this is negligible. For distributed tracing where you are aggregating spans from microservices, network jitter adds up.

Hosting your monitoring stack on a VPS Norway instance connected directly to NIX (Norwegian Internet Exchange) ensures that your "internal" latency checks are accurate. You aren't measuring the internet; you are measuring your application.

Implementing a Blackbox Exporter

Don't trust localhost. Use a Blackbox exporter to ping your services from the "outside" (or a separate VLAN). Here is a robust Docker Compose setup for 2025-era monitoring, stripping away the bloat:

services:
  prometheus:
    image: prom/prometheus:v2.54.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      # Critical for keeping memory usage low on high-load instances
      - '--storage.tsdb.retention.time=15d'
      - '--web.enable-lifecycle'
    ports:
      - "9090:9090"
    deploy:
      resources:
        limits:
          memory: 4G

  node-exporter:
    image: prom/node-exporter:v1.8.2
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"

volumes:
  prometheus_data:

Compliance: The GDPR Elephant

Since Schrems II and the subsequent tightening of Datatilsynet regulations, exporting log data containing PII (Personally Identifiable Information) outside the EEA—or even outside Norway for strict sectors—is a legal minefield. Access logs often contain IP addresses. If you ship those logs to a US-cloud-based SaaS monitoring tool, you are technically exporting data.

Self-hosting your stack on CoolVDS keeps the data strictly within Norwegian jurisdiction. You own the disk, you own the encryption keys, and the data never leaves the Oslo data center unless you say so. This simplifies your ROPA (Record of Processing Activities) significantly.

The Architecture of Resilience

Reliability isn't about buying the most expensive tool; it's about reducing complexity. A single, well-tuned Prometheus instance running on a high-performance NVMe VPS often outperforms a complex, distributed federated setup that no one on your team understands how to debug.

If you are tired of wondering why your dashboards are lagging or why your I/O wait is spiking during query execution, it's time to look at the metal underneath your metrics.

Don't let slow I/O kill your observability. Deploy a high-frequency NVMe instance on CoolVDS in 55 seconds and see what you've been missing.

🍪 We Value Your Privacy

Privacy & Cookie Settings

Your Privacy Rights

Surviving Cardinality Hell: A Battle-Tested Guide to Infrastructure Monitoring in 2025

Surviving Cardinality Hell: A Battle-Tested Guide to Infrastructure Monitoring in 2025

The Hidden Cost of Metric Cardinality

1. Ruthless Relabeling

Storage IOPS: The Bottleneck No One Talks About

Network Latency and the "Oslo Edge"

Implementing a Blackbox Exporter

Compliance: The GDPR Elephant

The Architecture of Resilience

/// RELATED POSTS

Edge Computing in Norway: Architecting for Sub-5ms Latency in 2025

Kubernetes Networking Deep Dive: Optimizing Packet Flow for Low Latency in 2025

Surviving the Packet Storm: A Deep Dive into Kubernetes Networking & CNI Performance in 2025

Surviving the Millisecond War: Edge Computing Architectures for the Nordic Market

Kubernetes Networking Deep Dive: Why Your Packets Are Dropping in the Overlay

Serverless Without the Handcuffs: Implementing Private FaaS Patterns on High-Performance VDS in 2025