Scaling Infrastructure Monitoring: Surviving the Metrics Explosion in 2021
There is nothing quite like the silence of a catastrophic failure. The dashboard is green. The alerts are silent. Yet, customers are screaming on Twitter because the checkout page is timing out. Conversely, there is nothing worse than the "boy who cried wolf" scenario: PagerDuty firing at 03:14 AM because a backup job spiked CPU usage for 30 seconds.
If you are managing infrastructure in 2021, you are likely drowning in data. We moved from monoliths to microservices, and suddenly we have 50 containers where we used to have one server. The result? A cardinality explosion that creates more noise than signal.
I have spent the last decade architecting systems across the Nordics. I've seen monitoring stacks that cost more than the production infrastructure they were watching. Today, we are going to fix that. We will look at how to build a monitoring architecture that scales, respects Norwegian data sovereignty (Schrems II is not a suggestion), and actually helps you sleep.
The Cardinality Trap: Just Because You Can, Doesn't Mean You Should
The most common mistake I see in `prometheus.yml` files is the "inhale everything" approach. When you automatically scrape every pod and endpoint, you ingest high-cardinality labels—user IDs, request IDs, or dynamic pod names that change every deploy.
This kills your time-series database (TSDB). Query performance tanks. Grafana dashboards time out.
Pro Tip: In Prometheus, use metric_relabel_configs to drop high-cardinality labels before ingestion. If a label value is unique to a single request, it does not belong in your metrics. It belongs in your logs.
Optimizing the Scrape Config
Here is a battle-tested configuration snippet we use to sanitize metrics coming from our Kubernetes 1.19 clusters. This drops the noisy pod_template_hash which adds zero value to historical trending.
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
metric_relabel_configs:
# DROP high cardinality labels
- regex: 'controller_revision_hash|pod_template_hash'
action: labeldrop
Data Sovereignty: The "Schrems II" Reality Check
July 2020 changed everything for us in Europe. The CJEU's Schrems II ruling effectively invalidated the Privacy Shield. If you are piping your server logs and system metrics—which often contain IP addresses or user metadata—to a US-based SaaS monitoring provider, you are walking a legal tightrope.
Datatilsynet (The Norwegian Data Protection Authority) has been clear: you are responsible for where your data flows. This is why hosting your monitoring stack on domestic infrastructure isn't just about latency; it's about compliance.
By running your Prometheus and ELK (Elasticsearch, Logstash, Kibana) stack on a VPS Norway instance, you keep data within the jurisdiction. You also benefit from direct peering at NIX (Norwegian Internet Exchange) in Oslo. If your infrastructure is in Oslo and your monitoring SaaS is in Virginia (us-east-1), you are introducing 80ms+ of latency just to check if your server is alive. That is unacceptable for high-frequency trading or real-time gaming backends.
The Hardware Beneath the Software
You can tune software all day, but if your underlying storage I/O is garbage, your ELK stack will crawl. Elasticsearch is notoriously I/O hungry. Indexing thousands of log lines per second requires random write speeds that spinning rust (HDD) or cheap SATA SSDs simply cannot handle.
We see this constantly: A DevOps engineer complains that Grafana is slow. They blame the query. In reality, the iowait on the monitoring server is hitting 40%.
This is where NVMe storage becomes non-negotiable. At CoolVDS, we don't upsell NVMe as a "premium" feature; it's the baseline. When ingestion spikes during a DDoS attack or a Black Friday sale, NVMe queues handle the load without blocking the CPU.
Diagnosing "Noisy Neighbors" and CPU Steal
If you are on a budget VPS provider, your monitoring might be lying to you. If the host node is overcommitted, your VM waits for CPU cycles. This is reported as "Steal Time" (st in top).
Use this simple check to see if your current host is stealing your performance:
#!/bin/bash
# Check for CPU steal time > 0.5%
STEAL=$(iostat -c 1 2 | tail -n 1 | awk '{print $5}')
LIMIT=0.5
if (( $(echo "$STEAL > $LIMIT" | bc -l) )); then
echo "CRITICAL: High CPU Steal Time detected: ${STEAL}%"
echo "Your noisy neighbors are killing your latency."
else
echo "OK: Steal time is acceptable: ${STEAL}%"
fi
If you see double-digit steal time, no amount of code optimization will save you. You need to migrate to a provider with strict resource isolation.
Smart Alerting: The "3 AM Test"
Alertmanager is powerful, but dangerous. The default configuration often sends an email for every single firing alert. If a cluster of 50 nodes goes down, you don't want 50 emails. You want one email saying "Cluster Down".
We use group_wait and group_interval to dampen the noise. This forces Alertmanager to wait a few seconds to see if other related alerts fire, grouping them into a single notification.
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # Initial wait to buffer alerts
group_interval: 5m # Wait before sending updated notification
repeat_interval: 4h # Don't spam if it's not fixed yet
receiver: 'pagerduty-critical'
Visualizing Throughput: Nginx & MySQL
Finally, let's talk about what to put on the screen. For a standard LAMP/LEMP stack, you need to correlate web server throughput with database latency. If Nginx requests go up and MySQL latency stays flat, you are scaling well. If both spike, you have a bottleneck.
Here is a snippet for checking MySQL InnoDB Buffer Pool efficiency—a critical metric for database performance on managed hosting environments.
SELECT
(Innodb_buffer_pool_read_requests - Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests * 100
AS buffer_pool_hit_rate
FROM information_schema.global_status
WHERE variable_name IN ('INNODB_BUFFER_POOL_READ_REQUESTS', 'INNODB_BUFFER_POOL_READS');
If this is below 99%, and you have available RAM, increase your innodb_buffer_pool_size immediately.
Conclusion
Building a monitoring stack in 2021 requires balancing technical depth with legal compliance. You cannot ignore Schrems II, and you cannot ignore the physics of hardware. The best monitoring software in the world is useless if it runs on a choked VPS with high latency.
Stability starts at the metal. If you are ready to stop fighting with resource contention and start monitoring with precision, you need infrastructure that respects your engineering standards.
Deploy a compliant, high-performance monitoring node in Oslo today. Spin up a CoolVDS NVMe instance in under 55 seconds and see what you've been missing.