Console Login

Surviving the Data Exodus: Self-Hosting High-Scale Monitoring in a Post-Schrems II World

Surviving the Data Exodus: Self-Hosting High-Scale Monitoring in a Post-Schrems II World

It is 3:00 AM. Your phone buzzes. It's not a text from a friend; it's PagerDuty. But the alert isn't telling you the server is down—it’s telling you that your US-based monitoring SaaS is lagging by 15 minutes because of a transatlantic routing hiccup. You are flying blind.

To make matters worse, since the CJEU invalidated the Privacy Shield in the Schrems II ruling last month (July 2020), sending your server logs and IP addresses to US-hosted monitoring clouds is now a legal minefield. If you are operating in Norway or the wider EEA, the days of blindly piping /var/log to a third-party across the pond are effectively over. The Datatilsynet (Norwegian Data Protection Authority) is not known for its leniency regarding data transfers.

We need to bring the data home. And we need to do it without crashing the disks.

The Architecture of Sovereignty: Prometheus & Grafana

In the current landscape of August 2020, the only viable answer for a DevOps engineer who cares about both latency and legality is a self-hosted Time Series Database (TSDB). Prometheus has won the war against Nagios for metric collection, and Grafana is the undisputed king of visualization.

But here is the trap: TSDBs are I/O vampires. I have seen decent sysadmins spin up a Prometheus instance on a cheap VPS with standard SSDs (or worse, spinning rust), only to have the system choke when ingestion rates hit 50k samples per second. The CPU waits, the queue fills, and you lose data.

Pro Tip: Never underestimate the iowait metric. If your monitoring server shows > 10% iowait, your storage backend is the bottleneck. This is why at CoolVDS, we strictly use NVMe storage for our KVM instances. High IOPS aren't a luxury for monitoring; they are a requirement.

Step 1: The Foundation

We are going to deploy a Prometheus v2.20 instance using Docker. While Kubernetes is great, for a dedicated monitoring node, a clean Docker Compose setup often reduces overhead and complexity. Ensure your host machine is running a recent kernel (5.4+) for better overlayfs performance.

First, verify your disk I/O. Do not proceed if you cannot sustain random writes.

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1

If you aren't seeing at least 15k IOPS, stop. You need better hardware. On a CoolVDS NVMe slice, you should breeze through this.

Step 2: Configuration for Scale

The default Prometheus configuration is too polite. We need to tune the retention and the block duration. Here is a production-grade docker-compose.yml tailored for 2020 standards:

version: '3.7'

services:
  prometheus:
    image: prom/prometheus:v2.20.1
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.wal-compression'  # Crucial for saving space
      - '--web.enable-lifecycle'
    ports:
      - 9090:9090
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.0.1
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:

Note the --storage.tsdb.wal-compression flag. This was introduced recently and significantly reduces the disk footprint of the Write-Ahead Log. Essential when you are paying for premium storage.

Step 3: Scraping the Fleet

Your prometheus.yml needs to be smart. Don't just hardcode IPs. If you are running a dynamic environment, use file_sd_configs for service discovery. Here is a robust config snippet:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'coolvds-nodes'
    scrape_interval: 10s
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
        labels:
          region: 'no-oslo-1'
          env: 'production'
  
  - job_name: 'postgres'
    static_configs:
      - targets: ['10.0.0.8:9187']
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'pg_stat_database_.*'
        action: keep

The Latency Argument

Why host this in Norway? Aside from the GDPR/Schrems II headache, it is about physics. If your infrastructure is in Oslo, and your monitoring is in Virginia (us-east-1), you are dealing with 80ms+ latency purely on the network round trip. When you are tracing a microservice bottleneck, that latency masks the real spikes.

Hosting locally on a provider like CoolVDS, connected directly to NIX (Norwegian Internet Exchange), ensures your monitoring probe latency is sub-millisecond relative to your workload.

SaaS vs. Self-Hosted (Post-2020 Reality)

Feature US-Based SaaS Monitoring Self-Hosted (CoolVDS)
Data Sovereignty High Risk (Schrems II) 100% Compliant
Data Ingestion Cost $$$ per GB/metric Fixed Hardware Cost
Retention Often limited (14 days) Disk dependent (Months/Years)
Customization Vendor Locked Infinite (Open Source)

Detecting the "Silent Killers"

CPU usage is a vanity metric. The real killers in 2020 are entropy starvation and open file descriptors. Here is how you catch a file descriptor leak before it crashes your Nginx server.

Add this alert rule to your Prometheus alerts.yml:

groups:
- name: system_limits
  rules:
  - alert: FdExhaustionClose
    expr: node_filefd_allocated / node_filefd_maximum * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "File descriptor limit reaching critical levels on {{ $labels.instance }}"
      description: "allocated file descriptors > 80% (current: {{ $value }})"

This simple rule has saved more production environments than I can count. When a Java application starts leaking handles, this will ping you days before the crash.

The Hardware Truth

You cannot cheat physics, and you cannot cheat the database. TSDBs rely on merging data blocks on disk. This process, compaction, is write-intensive. On a standard HDD or a shared SATA SSD, compaction causes "stalls"—gaps in your graphs where the server was too busy writing old data to accept new data.

This is where infrastructure choice becomes architectural, not just financial. CoolVDS instances are backed by enterprise NVMe arrays. We don't throttle IOPS to force you to upgrade. When your Prometheus compaction kicks in at 02:00 AM, the NVMe absorbs the hit, and your graphs stay seamless.

In a world where data transfer legality is crumbling and uptime expectations are rising, owning your stack is the only path forward. Don't let a lawyer in Brussels or a cable cut in the Atlantic dictate your visibility.

Ready to bring your metrics home? Spin up a high-performance NVMe instance on CoolVDS today and secure your data within Norwegian borders.