Stop Letting Silent Failures Kill Your Holidays
It is December 14th. If you work in e-commerce or run high-traffic infrastructure, you are likely surviving on caffeine and sheer anxiety. The traffic is peaking, and the last thing you need is a pager alert at 3:00 AM because a disk filled up or a database locked up.
I have been there. In 2015, I watched a Magento cluster implode during Black Friday. Why? Because our monitoring system—a bloated Nagios setup—was checking if the server was alive, but not if it was healthy. The server responded to Ping, but the I/O wait was hitting 90% because of a noisy neighbor on a cheap shared VPS host. The site wasn't down; it was just taking 45 seconds to load. To the customer, that is the same thing.
Today, we are fixing this. We are moving away from binary "up/down" checks and towards granular, time-series metrics using Prometheus and Grafana. This is the stack that defines 2018 infrastructure.
The Problem with "Is It Up?"
Traditional monitoring tools poll your server every 5 minutes. In the world of microservices and containers, a 5-minute interval is an eternity. A CPU spike can crash a worker process and restart it within 30 seconds. Your 5-minute poll will miss it entirely, leaving you wondering why your error logs are full of 502 Bad Gateway errors.
You need metrics scraping at 10-15 second intervals. You need to know the rate of change, not just the static value.
The 2018 Stack: Prometheus + Node Exporter
We are going to deploy this on Ubuntu 18.04 LTS. If you are still running 14.04, stop reading and go upgrade your OS. Seriously.
1. The Exporter Strategy
Prometheus doesn't use agents in the traditional sense. It pulls (scrapes) data from endpoints. The first thing we need on your target servers—whether they are web nodes, database masters, or load balancers—is the node_exporter. It exposes kernel-level metrics over HTTP.
Here is how to set it up as a systemd service so it survives reboots:
# Create a user specifically for the exporter
useradd --no-create-home --shell /bin/false node_exporter
# Download the binary (Version 0.17.0 is current stable as of late 2018)
wget https://github.com/prometheus/node_exporter/releases/download/v0.17.0/node_exporter-0.17.0.linux-amd64.tar.gz
tar xvf node_exporter-0.17.0.linux-amd64.tar.gz
cp node_exporter-0.17.0.linux-amd64/node_exporter /usr/local/bin/
chown node_exporter:node_exporter /usr/local/bin/node_exporter
Next, create the service file at /etc/systemd/system/node_exporter.service:
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
Enable and start it. You should now see metrics flowing at http://YOUR_SERVER_IP:9100/metrics.
2. Configuring Prometheus
On your monitoring server (I recommend a dedicated instance for this), install Prometheus v2.5. The configuration is YAML-based. This is where the magic happens.
Edit /etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s # Scrape every 15 seconds. High resolution.
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['10.0.0.5:9100', '10.0.0.6:9100'] # Your application servers
Pro Tip: Do not expose your metrics port (9100) to the public internet. Use a VPN or VPC peering. If you are hosting with CoolVDS, use the private network interface for metric scraping to avoid bandwidth charges and keep data secure within the datacenter. Security through obscurity is not security.
Visualizing the Truth with Grafana
Prometheus collects the data, but Grafana makes it readable. Grafana 5.4 was released recently and it brings significant improvements to dashboard provisioning.
Once you connect Prometheus as a data source, you need to look at the metrics that actually matter. Do not just stare at CPU usage. CPU usage is misleading.
The Metric That Matters: iowait
In a virtualized environment, Steal Time and I/O Wait are your enemies. They indicate that the host node is overcommitted.
Query this in Grafana:
rate(node_cpu_seconds_total{mode="iowait"}[5m])
If this graph spikes, your disk cannot keep up with your database writes. This is common in shared hosting environments where "SSD" often means "Consumer SATA SSD shared by 500 users."
This is where infrastructure choice becomes critical. You can tune MySQL's innodb_io_capacity all day, but if the underlying storage is choking, you will see latency. At CoolVDS, we use enterprise-grade NVMe storage and KVM virtualization. KVM ensures stricter isolation than container-based virtualization (like OpenVZ), meaning your neighbors can't steal your I/O operations.
Data Sovereignty and GDPR
Since May 25th of this year (2018), GDPR has changed how we handle logs. IP addresses in access logs are considered PII (Personally Identifiable Information).
When you are aggregating logs (perhaps using the ELK stack alongside Prometheus), you must ensure this data remains within legal boundaries. For Norwegian companies, hosting your monitoring stack and data inside Norway (or the EEA) simplifies compliance with Datatilsynet requirements.
| Feature | Public Cloud (US Regions) | CoolVDS (Oslo) |
|---|---|---|
| Latency to NIX | 20-40ms | < 5ms |
| Data Sovereignty | Privacy Shield (Complex) | Norwegian Law (Simple) |
| Neighbor Noise | High Variance | Strict KVM Isolation |
Alerting: Don't Wake Me Up Unless It's Fire
The final piece is Alertmanager. Do not alert on "High CPU." Alert on "High Latency" or "Error Rate." Users don't care if your CPU is at 90%; they care if the checkout page loads.
Here is a rule for high error rates (over 5% of requests failing):
groups:
- name: web_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High failure rate on {{ $labels.instance }}"
Conclusion
Visibility is the only difference between a professional system administrator and a firefighter. By implementing Prometheus and Grafana, you gain the ability to see trends over weeks, not just seconds. You can correlate a deployment at 2:00 PM with a memory leak that starts at 4:00 PM.
However, software monitoring cannot fix hardware limitations. If your monitoring shows constant I/O wait or CPU steal time, your current provider is overselling their capacity.
Don't let slow I/O kill your SEO rankings or your holiday sales. Deploy a KVM-based, NVMe-powered instance on CoolVDS today. Spin up a test server in 55 seconds and see what 0% steal time looks like.