Console Login

Monitoring is Broken: Why Green Nagios Checks Don't Mean Happy Users (A 2014 Guide)

Monitoring is Broken

Monitoring is Broken: Why Green Nagios Checks Don't Mean Happy Users

It’s 3:42 AM. My phone buzzes off the nightstand. I groggily check the SMS: "Service Critical: Load Average > 10." By the time I SSH into the jump host, the alert clears. Five minutes later, it’s back. This flapping nightmare is the reality for most sysadmins relying on legacy monitoring in 2014. We trust our "OK" status lights too much, conflating availability with performance. Just because port 80 accepts a connection doesn't mean your application isn't bleeding out milliseconds on every database query.

After a decade of managing high-traffic clusters across Europe, I’ve learned a hard truth: Availability is binary, but performance is analog. Your users don't care if the server is technically "up" if the Time to First Byte (TTFB) is 4 seconds. In this post, we are going to tear down the old "Ping and Pray" methodology and build a metrics pipeline that actually tells you what is happening inside your stack, using Graphite, Grafana, and the ELK stack.

The "Steal Time" Ghost

Before we touch the software, we need to address the elephant in the data center: your underlying hardware. I once spent three days debugging a random latency spike on a Magento checkout process. The code hadn't changed. The MySQL config was untouched.

The culprit? %st (Steal Time).

We were hosting on a budget VPS provider that oversold their CPU cores. A neighbor on the same physical hypervisor decided to mine Bitcoin (or compile the Linux kernel, who knows), and the hypervisor stole cycles from our VM to feed them. You cannot tune nginx.conf to fix a noisy neighbor.

Pro Tip: Always check steal time first when diagnosing intermittent slowness on a VPS. If it's consistently above 0.5%, move hosts.
# The 'top' command shows steal time under %st
top - 14:20:15 up 12 days,  4:15,  1 user,  load average: 0.15, 0.08, 0.05
Tasks:  85 total,   1 running,  84 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.5%us,  1.0%sy,  0.0%ni, 95.0%id,  0.2%wa,  0.0%hi,  0.1%si,  1.2%st

This is why, for mission-critical production environments, I strictly recommend providers like CoolVDS. They use KVM virtualization which provides stricter isolation than OpenVZ containers, and they don't oversell their cores. When you pay for a core, you get the cycles. Plus, with the recent shift to SSD-backed storage in their Oslo facility, I/O wait times (%wa) have effectively vanished.

Moving Beyond Nagios: The Metrics Pipeline

Nagios is great for telling you if a service is dead. It is terrible at telling you it is dying. To see the trend before the crash, we need time-series data. In 2014, the gold standard for this is Graphite paired with the new frontend, Grafana.

1. The Collector

You don't need heavy agents. A simple Bash script cron-job or a Python daemon sending plaintext to the Graphite Carbon port (2003) works wonders. Here is a dirty but effective script to ship Load Average and Free RAM to Graphite every minute:

#!/bin/bash
SERVER_NAME="web01_oslo"
GRAPHITE_HOST="10.0.0.5"
PORT=2003
TIMESTAMP=$(date +%s)

# Get Load Avg (1 min)
LOAD=$(cat /proc/loadavg | awk '{print $1}')
echo "servers.$SERVER_NAME.load.1min $LOAD $TIMESTAMP" | nc -w 1 $GRAPHITE_HOST $PORT

# Get Free Memory in MB
FREE_MEM=$(free -m | grep Mem | awk '{print $4}')
echo "servers.$SERVER_NAME.memory.free $FREE_MEM $TIMESTAMP" | nc -w 1 $GRAPHITE_HOST $PORT

Add this to your crontab. Suddenly, you aren't just seeing "Load is OK." You are seeing "Load has increased by 15% every day at 2 PM." That is actionable intelligence.

2. The Visualization

Graphite's built-in web interface looks like it was designed in 1995. This year, I've switched all my dashboards to Grafana. It talks to Graphite but renders beautiful, responsive HTML5 charts. You can overlay deploy events on top of CPU usage graphs to correlate code pushes with performance regressions.

Log Aggregation: Grep is Not a Strategy

If you are managing more than three servers, logging into each one to tail -f /var/log/nginx/error.log is madness. We need to aggregate. The ELK stack (Elasticsearch, Logstash, Kibana) has matured significantly with version 1.4.

However, Logstash is heavy on the JVM. For the shipping agent on the client nodes, I prefer Logstash-Forwarder (formerly Lumberjack). It is written in Go and leaves a tiny footprint.

Here is a critical Nginx configuration tweak to make your logs JSON-friendly, which saves Logstash from having to do complex Regex parsing:

http {
    log_format json_combined escape=json
      '{ "time_local": "$time_local", '
      '"remote_addr": "$remote_addr", '
      '"remote_user": "$remote_user", '
      '"request": "$request", '
      '"status": "$status", '
      '"body_bytes_sent": "$body_bytes_sent", '
      '"request_time": "$request_time", '
      '"upstream_response_time": "$upstream_response_time", '
      '"http_referrer": "$http_referrer", '
      '"http_user_agent": "$http_user_agent" }';

    access_log /var/log/nginx/access_json.log json_combined;
}

By logging $request_time and $upstream_response_time, you can visualize exactly how long PHP-FPM or your backend app is taking to generate a response, separate from network latency.

The Compliance Angle: Data Sovereignty in Norway

We operate in a post-Snowden world. Trust in US-based cloud giants is eroding. Many of my clients here in Oslo are asking tough questions about where their data physically resides. The Norwegian Data Inspectorate (Datatilsynet) is becoming stricter about how personal data (Personopplysninger) is handled under the Personal Data Act.

Latency is physics, but data sovereignty is law. Hosting your monitoring stack and production data within Norway isn't just about faster pings to NIX (Norwegian Internet Exchange); it's about legal safety. CoolVDS offers that jurisdiction certainty. Your bytes stay in Oslo. They aren't replicated to a data center in Virginia without your knowledge.

Optimizing the Network Stack

Finally, standard Linux distros like CentOS 6 or Ubuntu 14.04 come with very conservative networking defaults. If you are pushing thousands of metrics per minute, you might hit limits. Tune your /etc/sysctl.conf to handle the traffic:

# Increase the range of ephemeral ports
net.ipv4.ip_local_port_range = 1024 65535

# Allow reusing sockets in TIME_WAIT state for new connections
net.ipv4.tcp_tw_reuse = 1

# Increase the maximum number of open files
fs.file-max = 2097152

# Protect against SYN flood attacks
net.ipv4.tcp_syncookies = 1

Apply these with sysctl -p. These settings ensure your monitoring agents don't choke when the network gets busy.

Summary: Don't Fly Blind

The era of manually checking disk space is over. To survive in 2015, you need granular visibility. You need to know if your database I/O is saturating before the site locks up. You need to know if your upstream provider is stealing your CPU cycles.

Build your Graphite/ELK stack. Visualize the data. And run it on infrastructure that respects your need for raw, dedicated performance.

Ready to stop fighting noisy neighbors? Spin up a pure-SSD KVM instance on CoolVDS in Oslo. The network latency is low, the I/O is high, and your data stays in Norway.