Console Login

Silence the Pager: A Battle-Tested Guide to Infrastructure Monitoring at Scale (2018 Edition)

Silence the Pager: A Battle-Tested Guide to Infrastructure Monitoring at Scale

It is December 24th. While normal people are wrapping gifts or prepping ribbe, you are staring at a terminal, wondering why the load average on your primary database node just spiked to 50.0. Silence is golden, until it's the silence of a crashed server that failed to send an alert before it died.

I have been there. I have seen production environments melt down because we relied on "gut feeling" and manual top checks rather than granular, historical data. In the wake of the GDPR enforcement earlier this year (May 2018), the stakes are higher. You cannot just dump logs into a US-based cloud bucket and hope for the best anymore. You need visibility, you need it hosted locally, and you need it now.

This guide isn't about installing a plugin. It is about architecting a monitoring stack that actually lets you sleep.

The Myth of "Uptime"

Most VPS providers tout "99.9% network uptime." That is a vanity metric. It means their switch is on. It tells you nothing about whether your MySQL process is deadlocked or if your disk I/O is saturated. Real reliability comes from white-box monitoring—knowing the internals of your systems.

In 2018, the industry standard for this is shifting rapidly from Nagios to Prometheus. Unlike the push-based models of the past, Prometheus pulls metrics. This is critical for scale. If your monitoring system gets overloaded, it shouldn't crash your production app by failing to receive data.

Step 1: The Eyes on the Ground (Node Exporter)

You need an agent to expose kernel-level metrics. We use the node_exporter. It is lightweight, written in Go, and gives us the raw truth about the hardware.

Do not just run the binary. create a proper systemd service user. If you are running this on a CoolVDS instance (which uses pure KVM), you have full control over these service definitions.

# /etc/systemd/system/node_exporter.service [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target

Once enabled, you can curl the metrics locally to verify:

curl localhost:9100/metrics | grep node_load1

You should see a raw float value. If you see this, you are winning.

Step 2: The Silent Killer (CPU Steal Time)

Here is where most generic cloud hosting fails. You buy 4 vCPUs, but you are sharing the physical core with twenty other noisy neighbors. When they spike, you slow down.

In your monitoring, you must watch node_cpu_seconds_total{mode="steal"}. If this metric rises, your code is fine, but your host is choking.

Pro Tip: On CoolVDS NVMe instances, we strictly limit overselling. We monitor our hypervisors to ensure your %st (steal time) stays near zero. If you are seeing high steal time on your current provider, migrate. No amount of code optimization fixes a crowded physical host.

Step 3: Configuring Prometheus

Prometheus configuration is defined in YAML. It needs to know where to scrape. Here is a battle-hardened prometheus.yml configuration block that includes a basic scrape config for your infrastructure.

global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'coolvds_nodes' static_configs: - targets: ['10.0.0.5:9100', '10.0.0.6:9100'] relabel_configs: - source_labels: [__address__] regex: '10\.0\.0\.5:9100' target_label: instance replacement: 'db-master-oslo'

Notice the scrape interval. 15 seconds is granular enough for most web apps. Anything less creates massive data storage requirements; anything more misses micro-bursts.

Step 4: visualizing the Pain (Grafana)

Raw data is useless if you cannot read it. Grafana 5.x has brought significant improvements in dashboarding. Connect it to your Prometheus data source.

Do not build dashboards from scratch. Use the community standard ID 1860 (Node Exporter Full) as a base, then strip it down. You only care about:

  • I/O Wait: Is the disk too slow? (Critical for databases).
  • Network Traffic: Are you under DDoS attack?
  • Saturation: Are you running out of file descriptors?

The Norwegian Context: Latency and Law

Why does geography matter in 2018? Two reasons: Physics and Law.

1. The Speed of Light

If your users are in Norway, but your monitoring server is in Virginia (us-east-1), you are dealing with 90ms+ latency just for the packet round trip. When diagnosing a microservice architecture, that network jitter masks the real problems. Hosting your monitoring stack on CoolVDS in Oslo puts you milliseconds away from the NIX (Norwegian Internet Exchange).

2. GDPR & Datatilsynet

Since May 25th, storing IP addresses and user identifiers has become a legal minefield. If your logs contain PII (Personally Identifiable Information) and you ship them to a non-compliant jurisdiction, you are non-compliant. By keeping your Prometheus and ELK stacks on Norwegian soil, you simplify your data processor agreements significantly.

Automating Alerts (So You Can Sleep)

The biggest mistake I see is alerting on CPU > 80%. A database batch job should use 100% CPU. That is what you paid for. Alert on symptoms, not causes.

Here is an alert.rules snippet for high latency, which is a symptom that actually impacts users:

groups: - name: latency_alerts rules: - alert: HighLatency expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5 for: 2m labels: severity: page annotations: summary: "High latency on {{ $labels.instance }}"

This rule triggers only if the 99th percentile latency exceeds 500ms for more than 2 minutes. This eliminates flapping alerts that wake you up for a 10-second hiccup.

Conclusion

Infrastructure monitoring isn't about pretty charts. It's about knowing exactly when and why your system is failing so you can fix it before the customer notices. In late 2018, the tools are mature. Prometheus and Grafana are robust, open-source standards.

But software is only as good as the hardware it runs on. You need low latency to Oslo, high I/O throughput for your time-series databases, and strict isolation to prevent CPU steal. That is exactly what we built CoolVDS to provide.

Don't wait for the next crash. Deploy a Prometheus instance on a CoolVDS NVMe server today and see what your infrastructure is actually doing.