Console Login

Silence the Noise: Effective Infrastructure Monitoring at Scale (2019 Edition)

The 3 AM PagerDuty Wake-Up Call

It’s 03:14. Your phone is buzzing off the nightstand. The monitoring system screams that CPU usage on app-server-04 hit 90%. You groggily SSH in, run htop, and see... nothing. The load has already dropped. The site is up. You just lost two hours of sleep for a ghost. If you manage infrastructure, you know this pain. It is the result of "vanity metrics" and poor alerting thresholds.

In 2019, simply installing Nagios and checking if port 80 is responding is professional negligence. With microservices and containerization becoming the standard—especially with the rise of Kubernetes 1.14—infrastructure is ephemeral. We need monitoring that understands trends, saturation, and latency, not just simple up/down states. This guide breaks down how to build a monitoring stack that actually works, focusing on the specific challenges we face here in the Nordics, from GDPR compliance to latency across the NIX (Norwegian Internet Exchange).

The "Steal Time" Trap in Virtualized Environments

Before we touch software, we have to talk about the platform. The number one reason for unexplained application slowness on a VPS is CPU Steal Time (%st). This metric represents the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another processor.

If you are hosting on budget providers who oversell their cores, your monitoring will show 50% CPU usage, but your app will be unresponsive. Why? Because your neighbor is mining crypto or compiling a kernel. You can check this instantly on any Linux box:

top - 15:42:12 up 10 days, 2:11, 1 user, load average: 0.08, 0.03, 0.01 %Cpu(s): 2.0 us, 0.3 sy, 0.0 ni, 97.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.2 st

See that 0.2 st at the end? That is acceptable. If that number hits 10% or higher, move your workload. At CoolVDS, we strictly limit oversubscription and use KVM (Kernel-based Virtual Machine) to ensure resource isolation. We don't play the "noisy neighbor" game because we know that consistent latency is better than raw burst speed that disappears when you need it.

The Stack: Prometheus & Grafana

Forget the proprietary SaaS tools that charge by the data point and send your user data across the Atlantic (a massive GDPR headache for Norwegian companies). The industry standard in 2019 is open-source: Prometheus for metrics collection and Grafana 6.0 for visualization.

Prometheus operates on a "pull" model. Your servers expose metrics via HTTP, and Prometheus scrapes them. This is superior to push models for infrastructure because if a server goes down, Prometheus knows it failed to scrape, instantly detecting the outage.

1. Deploying the Exporters

First, don't just monitor the OS. Monitor the hardware and the services. We use node_exporter for system metrics. Here is how to set it up as a systemd service on Ubuntu 18.04 LTS:

useradd -rs /bin/false node_exporter wget https://github.com/prometheus/node_exporter/releases/download/v0.18.0/node_exporter-0.18.0.linux-amd64.tar.gz tar -xvf node_exporter-0.18.0.linux-amd64.tar.gz mv node_exporter-0.18.0.linux-amd64/node_exporter /usr/local/bin/

Next, create the service file at /etc/systemd/system/node_exporter.service:

[Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target

Reload daemon and start:

systemctl daemon-reload systemctl start node_exporter systemctl enable node_exporter

2. Configuring Prometheus

Once your nodes are exposing data on port 9100, configure prometheus.yml. We use service discovery where possible, but for a static set of VPS instances, a static config works fine.

Pro Tip: If you are monitoring servers in different locations (e.g., Oslo and Frankfurt), keep the scrape interval reasonable. A 15s scrape interval over the public internet can lead to gaps if network jitter occurs. For cross-region monitoring, we recommend federated Prometheus servers.
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'coolvds_nodes_oslo'
    static_configs:
      - targets: ['10.0.0.5:9100', '10.0.0.6:9100']
        labels:
          region: 'norway'
          env: 'production'

  - job_name: 'mysql_metrics'
    static_configs:
      - targets: ['10.0.0.5:9104']

Alerting on What Matters (The Golden Signals)

Google's SRE book (published 2016) taught us to monitor the four Golden Signals: Latency, Traffic, Errors, and Saturation. CPU usage is a proxy, not a signal. A server at 95% CPU might be perfectly fine if the run queue is empty. A server at 20% CPU might be broken if it's waiting on disk I/O.

We use Alertmanager to handle notifications. Here is a rule that detects if an instance is under heavy I/O pressure, which is far more indicative of a user-facing slowdown than simple CPU usage. This checks if the disk read time exceeds 100ms consistent over 2 minutes.

groups:
- name: host_alerts
  rules:
  - alert: HighDiskLatency
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High disk latency on {{ $labels.instance }}"
      description: "Disk latency is above 100ms (current value: {{ $value }}s)"

The I/O Bottleneck: NVMe vs. SATA

In 2019, we still see providers offering "SSD VPS" that are actually backed by SATA SSDs in a RAID array that is getting hammered by 500 other tenants. When your database tries to flush the InnoDB buffer pool, your latency spikes.

For high-performance databases (PostgreSQL, MySQL, MongoDB), I/O wait (iowait) is the enemy. This is where hardware selection becomes your primary monitoring tool. We built CoolVDS on pure NVMe storage because the queue depth on NVMe is exponentially higher than SATA. You can test your current provider's disk latency with fio. If the results show random read IOPS below 10k, you are going to have bottlenecks.

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randread --ramp_time=4

Data Residency and Latency in Norway

For our Norwegian clients, latency isn't just a technical metric; it's a user experience requirement. Hosting in Frankfurt or London usually adds 20-30ms of round-trip time compared to hosting directly in Oslo. When your application requires multiple round-trips to render a page (common with modern JavaScript frameworks), that 30ms compounds into visible delay.

Furthermore, while the EU has GDPR, Norway has the Datatilsynet which is notoriously strict. Keeping your monitoring data—which often contains IP addresses and user identifiers—within Norwegian borders simplifies compliance significantly. CoolVDS infrastructure is physically located in Oslo, ensuring your logs and metrics stay under local jurisdiction.

Conclusion

Monitoring is not about pretty graphs; it is about knowing exactly when and why your system is degrading. By switching from check-based monitoring (Nagios) to trend-based monitoring (Prometheus), and by choosing infrastructure that minimizes "steal time" and I/O wait, you build resilience.

Don't let slow I/O or noisy neighbors kill your uptime metrics. If you need a consistent, low-latency baseline to build your monitoring stack on, deploy a CoolVDS NVMe instance today. We provide the raw performance; you provide the code.