Console Login

Infrastructure Monitoring at Scale: From Alert Fatigue to Observable Sanity

Infrastructure Monitoring at Scale: From Alert Fatigue to Observable Sanity

It’s 3:14 AM. Your phone buzzes. PagerDuty is screaming about high CPU usage on db-node-04. You stumble out of bed, SSH in, run htop, and see... nothing. The spike lasted 45 seconds and vanished. You go back to sleep, only to be woken up again at 4:00 AM. This is not DevOps; this is torture.

If you are managing anything beyond a handful of servers, "looking at logs" doesn't cut it. You need observability. But most setups are built wrong. They collect too much garbage data, store it on slow rotating disks, and alert on symptoms rather than causes. Today, we are tearing down a proper monitoring architecture using the Prometheus stack, tailored for high-performance environments like those we host at CoolVDS.

The Storage Bottleneck No One Talks About

Everyone focuses on the CPU overhead of monitoring agents. That's rarely the problem. The real killer is I/O.

Time Series Databases (TSDBs) like Prometheus rely heavily on write speeds. When you are scraping thousands of metrics per second from hundreds of containers, your disk is hammered with append-only writes. If your storage subsystem halts, your monitoring gaps appear. You lose visibility exactly when you need it—during high-load events.

I recall a project last year migrating a Magento cluster. The client hosted their monitoring stack on a budget provider's "standard SSD" tier. During a load test, the monitoring server's I/O wait (iowait) hit 40%. The metrics didn't just lag; they dropped. We were flying blind.

Pro Tip: Never put your TSDB on shared spinning rust or throttled SATA SSDs. Prometheus compaction processes require high IOPS. This is why standard CoolVDS instances are provisioned with NVMe storage. It’s not a luxury; for monitoring, it’s a requirement.

The Stack: Prometheus + Node Exporter + Grafana

We are sticking to the industry standard for 2022. No proprietary SaaS agents sending your data to the US (we will get to the legal reasons for that later).

1. Deploying the Scraper

Let's set up a Prometheus instance using Docker Compose. This ensures your monitoring infrastructure is portable and version-controlled.

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.35.0
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=15d'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    ports:
      - 9090:9090
    restart: unless-stopped
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.3.1
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - 9100:9100
    restart: unless-stopped
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:

Note the prometheus_data volume. On a CoolVDS instance, this maps directly to NVMe blocks, ensuring that the Write-Ahead Log (WAL) doesn't become a bottleneck.

2. Configuring the Scrape Targets

The prometheus.yml needs to be smart. We don't want to manually edit this file every time we spin up a new VPS. In a real environment, you would use file_sd_configs or a service discovery mechanism (Consul, EC2, or Kubernetes SD).

Here is a robust configuration for a static environment:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'nodes'
    scrape_interval: 10s
    static_configs:
      - targets: 
        - '10.0.0.5:9100'  # Database Master
        - '10.0.0.6:9100'  # App Server 1
        - '10.0.0.7:9100'  # App Server 2

The "Golden Signals" of Linux Monitoring

Don't just alert on CPU > 90%. A system can run at 100% CPU efficiently if the load average is low and tasks aren't waiting. Instead, monitor these:

Metric Why it matters PromQL Example
Saturation Is the resource full? (e.g., Disk I/O) rate(node_disk_io_time_seconds_total[1m])
Traffic Network demand. rate(node_network_receive_bytes_total[1m])
Errors Kernel or application failures. rate(node_vmstat_pgmajfault[1m])

The node_vmstat_pgmajfault (major page faults) is a classic performance killer. It means your application is trying to read memory that has been swapped out to disk. On slow disks, this freezes the app. On CoolVDS NVMe storage, it's faster, but still indicates you need to tune your innodb_buffer_pool_size or upgrade RAM.

Data Sovereignty: The Norwegian Advantage

Since the Schrems II ruling in 2020, sending IP addresses and system data to US-controlled clouds is a legal minefield. The Norwegian Data Protection Authority (Datatilsynet) is increasingly strict. Just recently, the use of Google Analytics has come under heavy fire across Europe.

Your monitoring data contains IP addresses, hostnames, and potentially sensitive error logs. Hosting this stack on a US-owned hyperscaler exposes you to the CLOUD Act. By running your Prometheus stack on CoolVDS in our Oslo data center, you keep your infrastructure metadata within Norwegian legal jurisdiction. It's compliant, it's safe, and frankly, the latency is better.

Network Latency and False Positives

If your monitoring server is in Frankfurt and your servers are in Oslo, a minor fiber cut in Denmark can trigger false "Host Down" alerts. Keep your monitoring close to your workload.

Latency check from a CoolVDS node in Oslo to NIX (Norwegian Internet Exchange):

root@oslo-monitor-01:~# mtr -r -c 10 193.156.90.1
Start: 2022-05-09T14:22:11+0200
HOST: oslo-monitor-01           Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- gateway.coolvds.net        0.0%    10    0.3   0.3   0.2   0.4   0.1
  2.|-- nix-peering.coolvds.net    0.0%    10    0.8   0.9   0.8   1.1   0.1
  3.|-- 193.156.90.1               0.0%    10    1.1   1.2   1.0   1.4   0.1

1.2ms average latency. This ensures that when an alert fires, it’s because the server is actually down, not because of internet weather.

Alerting That Doesn't Suck

Finally, configure alertmanager to group alerts. If a switch dies, you don't want 50 emails for 50 servers. You want one email saying "Rack 4 is unreachable."

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX'
    channel: '#ops-alerts'
    send_resolved: true

This configuration groups similar alerts and waits 30 seconds before firing, suppressing flapping alerts.

Conclusion

Monitoring isn't just about pretty Grafana dashboards; it's about reliable data ingestion and actionable intelligence. To achieve that, you need low latency, data sovereignty, and high-performance storage that won't choke on writes.

Don't let slow I/O kill your visibility. Deploy a high-performance monitoring stack on a CoolVDS NVMe instance today and see what's actually happening inside your infrastructure.