Beyond Nagios: Why "Green" Status Checks Hide Your Latency Spikes
It is 3:00 AM in Oslo. Your phone buzzes. It’s not a Nagios alert—Nagios says everything is fine. It’s the CEO. Customers are complaining that the checkout page on your Magento store is taking 45 seconds to load. You check your dashboard: CPU is at 10%, RAM is free, and the disk isn't full. According to your monitoring tools, the server is "healthy."
This is the failure of binary monitoring. In the world of high-availability hosting, knowing a service is "UP" is irrelevant if it's too slow to use.
As systems administrators, we have spent the last decade obsessed with uptime. But in 2013, uptime is the bare minimum requirement, not a metric of success. The real battleground is performance visibility—knowing not just if the server is running, but how it is running. This guide will move you from basic status checks to granular metric collection using tools like Graphite and StatsD, specifically tailored for the high-throughput environments we build at CoolVDS.
The "Nagios Green" Fallacy
Legacy monitoring tools like Nagios, Zabbix, or Cacti are excellent for alerting you when a process dies. They ask binary questions: "Is port 80 open?" or "Is load average below 5.0?"
However, they fail to answer the nuanced questions that actually impact your bottom line:
- Why did the MySQL query latency jump from 50ms to 500ms at 14:00?
- Is the disk I/O wait time caused by backups or a DDOS attack?
- Why is Nginx queuing connections despite low CPU usage?
Pro Tip: Never rely solely on external pings. A server can respond to an ICMP ping in 1ms while the Apache process is deadlocked and timing out every HTTP request. You need internal metrics.
The Solution: Metric Aggregation (Graphite & StatsD)
To solve this, we need to separate alerting (Nagios) from trending (Graphite). We want to push metrics to a collector that visualizes them over time. This allows us to correlate system events. For example, visualizing that a spike in iostat.await correlates exactly with a specific cron job.
1. The Architecture
We will use a standard stack gaining massive traction this year:
- Collector: StatsD (running on the local node)
- Storage/Graphing: Graphite (Carbon & Whisper)
- Visualization: Graphite Web (or verify trends via command line)
2. Configuring Nginx for Metrics
First, enable the stub_status module in Nginx. This gives us raw data to scrape. On your CoolVDS instance (CentOS 6 or Ubuntu 12.04), edit your virtual host config:
server {
listen 127.0.0.1:80;
server_name localhost;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
}
Reload Nginx: service nginx reload. Now, a curl to localhost gives us:
Active connections: 291
server accepts handled requests
16630948 16630948 31070465
Reading: 6 Writing: 179 Waiting: 106
3. Feeding the Data to Graphite
We don't want to just read this; we want to graph it. Here is a simple Python script using the statsd client library to push these metrics every 10 seconds. Save this as /opt/scripts/nginx_metrics.py:
#!/usr/bin/env python
import urllib2
import statsd
import time
import re
# Configure your Graphite/StatsD server
c = statsd.StatsClient('localhost', 8125)
while True:
try:
response = urllib2.urlopen('http://127.0.0.1/nginx_status')
content = response.read()
# Regex to parse the status output
active = re.search(r'Active connections:\s+(\d+)', content)
if active:
c.gauge('nginx.active_connections', int(active.group(1)))
# Parse Reading/Writing/Waiting
rww = re.search(r'Reading:\s+(\d+).*Writing:\s+(\d+).*Waiting:\s+(\d+)', content)
if rww:
c.gauge('nginx.reading', int(rww.group(1)))
c.gauge('nginx.writing', int(rww.group(2)))
c.gauge('nginx.waiting', int(rww.group(3)))
except Exception as e:
print "Error retrieving stats: ", e
time.sleep(10)
Running this script allows you to see connection spikes in real-time. If you see "Writing" connections spike while "Active" connections remain flat, your backend (PHP/MySQL) is slow. If "Waiting" spikes, you might be hitting worker_connections limits.
The Hardware Reality: Why Virtualization Matters
You can have the best monitoring in the world, but if your underlying infrastructure suffers from "Steal Time" (CPU Steal), your metrics will lie to you. In 2013, many VPS providers in Europe are still heavily oversubscribing OpenVZ containers. This leads to "noisy neighbor" issues where another customer's database backup kills your Magento store's latency.
This is where architecture choice is critical.
| Feature | Standard VPS (OpenVZ) | CoolVDS (KVM) |
|---|---|---|
| Kernel Access | Shared Host Kernel | Dedicated Kernel |
| Swap Memory | Fake/Burst | Dedicated Partition |
| IO Scheduling | Host Controlled (CFQ) | User Controlled (Deadline/Noop) |
| Isolation | Software Containers | Hardware Virtualization |
At CoolVDS, we exclusively use KVM (Kernel-based Virtual Machine) on top of SSD RAID-10 arrays. This allows you to define your own I/O scheduler inside the VM. For high-performance database workloads, switching from the default cfq to deadline or noop can reduce latency by 20%.
Check your current scheduler with:
cat /sys/block/vda/queue/scheduler
[cfq] deadline noop
Change it instantly without rebooting:
echo deadline > /sys/block/vda/queue/scheduler
Data Privacy and The Norwegian Context
Latency isn't the only performance metric; compliance is a metric of risk. With the increasing scrutiny from Datatilsynet and the complexity of the Personal Data Act (Personopplysningsloven), knowing exactly where your logs are stored is paramount. By hosting on CoolVDS, your data resides physically in Oslo, connected directly to NIX (Norwegian Internet Exchange).
This offers two distinct advantages:
- Legal: Your data never crosses borders, simplifying compliance with Norwegian privacy laws.
- Technical: Ping times to major Norwegian ISPs (Telenor, Altibox) are often sub-5ms.
War Story: The "Ghost" Bottleneck
Last month, we helped a client migrating from a generic German host to our Oslo facility. Their MySQL server was crashing randomly. Their old host's support blamed "high traffic," but their graphs showed low CPU usage.
We installed `sysstat` and ran extended I/O statistics:
iostat -x 1 10
The output revealed the truth:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 0.00 2.00 85.00 16.00 9500.00 109.38 25.40 250.50 9.00 98.50
Look at %util: 98.50%. The disk was thrashing. The await (average wait time for I/O) was 250ms. Their previous host was using spinning HDD storage shared among hundreds of users. The CPU was idle because it was waiting for the disk.
We moved them to a CoolVDS SSD instance. %util dropped to 5%, and page load times went from 4 seconds to 400ms. Same CPU, same RAM, different storage technology.
Final Thoughts
Don't wait for a user to report a slow site. By the time they complain, they have already gone to a competitor. Move beyond binary "up/down" monitoring. Implement Graphite, track your I/O wait times, and ensure your underlying infrastructure isn't stealing your performance.
Ready to stop guessing? Deploy a KVM instance on CoolVDS today. With our SSD-backed infrastructure and direct peering at NIX, you get the raw headroom you need to handle traffic spikes without the noisy neighbors.