Console Login

Surviving the Spike: Architecting Real-Time Infrastructure Monitoring in 2014

Surviving the Spike: Architecting Real-Time Infrastructure Monitoring in 2014

If your idea of monitoring is a cron job that pings your server every five minutes, you are already down. You just don't know it yet. In the last three months, I've audited a dozen infrastructure setups across Scandinavia, and the pattern is terrifyingly consistent: everyone has logs, but nobody has metrics.

We are operating in an era where "uptime" is a vanity metric. Your server can be up, responding to ICMP echo requests, while your MySQL connection pool is exhausted and your disk I/O wait is hitting 95%. To the ping check, you are green. To your users trying to checkout on their mobile devices, you are dead.

This post is about moving beyond binary status checks (Up/Down) to granular, high-resolution telemetry using the modern stack: Graphite, Grafana, and Collectd, while keeping Nagios for the critical alerts.

The War Story: When "Up" Means "Broken"

Last winter, I was called into a media house in Oslo. They were running a high-traffic news portal on a clustered LAMP stack. During the Sochi Olympics coverage, their site became unresponsive. Their dashboard? All green. Nagios reported HTTP 200 OK because the static front-page cache was being served by Varnish.

But the backend was melting. The writers couldn't save articles. The problem wasn't CPU; it was disk latency. Their shared storage solution was saturated by a backup job running on a neighbor's instance. If they had been graphing iostat metrics, they would have seen the write latency spike from 5ms to 500ms minutes before the collapse.

Pro Tip: Never trust shared spinning rust for high-IO workloads. This is why at CoolVDS, we enforce strict KVM isolation and offer SSD-cached storage or pure SSD options. We don't overcommit I/O. If you buy the IOPS, you get the IOPS.

Step 1: The Metrics Pipeline (Collectd + Graphite)

Forget parsing logs for performance data. It's too slow and heavy. In 2014, the standard for high-performance metrics is Collectd pushing to Graphite.

Collectd is a lightweight daemon written in C. It grabs system statistics every 10 seconds and shoots them over UDP to Graphite (Carbon). It has almost zero overhead.

Installing and Configuring Collectd on CentOS 6.5

First, enable the EPEL repository and install the daemon:

yum install epel-release yum install collectd collectd-rrdtool

Now, let's configure /etc/collectd.conf. We want to track CPU, Interface traffic, Load, Memory, and most importantly, Disk operations.

Hostname "web-node-01.oslo.dc"
FQDNLookup true
Interval 10

LoadPlugin syslog
LoadPlugin cpu
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin disk
LoadPlugin write_graphite


  Disk "vda"
  IgnoreSelected false



  
    Host "10.10.0.55"
    Port "2003"
    Protocol "tcp"
    LogSendErrors true
    Prefix "coolvds.production."
    StoreRates true
    AlwaysAppendDS false
    EscapeCharacter "_"
  

This configuration pushes metrics every 10 seconds to a private IP (always keep monitoring traffic on a private interface to avoid bandwidth costs and security risks). On CoolVDS, setting up a private VLAN between your web nodes and your monitoring node is a matter of a support ticket or a quick config change.

Step 2: visualizing with Grafana

Graphite's built-in web interface is... functional, but ugly. The new player in town this year is Grafana. It connects to your Graphite data source and lets you build dashboards that actually make sense to management.

When setting up your graphs, focus on the rate of change (derivative) rather than raw counters. A raw counter of "total HTTP requests" is useless. A graph showing "requests per second" allows you to correlate traffic spikes with load.

Here is how you might graph network traffic to detect a DDoS or a heavy download:

aliasByNode(scaleToSeconds(nonNegativeDerivative(coolvds.production.web-node-01.interface.eth0.if_octets.rx), 1), 4)

Step 3: Watching the Database (MySQL/MariaDB)

System metrics are half the story. The other half is inside MySQL. Use the Percona Monitoring Plugins for Nagios/Cacti, or write a custom script to feed Graphite. You need to watch Threads_connected and Innodb_buffer_pool_wait_free.

Here is a quick Python snippet (compatible with Python 2.6/2.7) to grab the connection count and send it to Graphite plaintext protocol:

#!/usr/bin/env python
import MySQLdb
import time
import socket

CARBON_SERVER = '10.10.0.55'
CARBON_PORT = 2003

def get_connections():
    db = MySQLdb.connect(host="localhost", user="monitor", passwd="secure_password")
    cursor = db.cursor()
    cursor.execute("SHOW GLOBAL STATUS LIKE 'Threads_connected'")
    row = cursor.fetchone()
    return int(row[1])

def send_msg(message):
    sock = socket.socket()
    sock.connect((CARBON_SERVER, CARBON_PORT))
    sock.sendall(message)
    sock.close()

if __name__ == '__main__':
    count = get_connections()
    timestamp = int(time.time())
    lines = [
        'coolvds.production.db01.mysql.threads_connected %d %d' % (count, timestamp)
    ]
    message = '\n'.join(lines) + '\n'
    send_msg(message)

The Norwegian Context: Data Sovereignty and Latency

Why do we obsess over this in Norway? Two reasons: Latency and Legislation.

1. Latency to NIX (Norwegian Internet Exchange): If your monitoring server is in Virginia (AWS US-East) and your servers are in Oslo, your monitoring latency is 100ms+. You will miss micro-bursts. By hosting your monitoring stack on a CoolVDS instance in Oslo, you get <2ms latency to your local infrastructure. You see problems as they happen.

2. Datatilsynet (Data Inspectorate): With the recent revelations regarding NSA surveillance, many Norwegian companies are moving data back home. While system metrics aren't usually PII (Personally Identifiable Information), log data often is. Sending logs containing IP addresses to a US-based SaaS monitoring tool can put you in a grey area regarding the Data Protection Directive. Keep it local. Keep it safe.

Hardware Matters: The I/O Bottleneck

All the software tuning in the world won't save you if your underlying host is oversubscribed. In virtualized environments, "Steal Time" (displayed as %st in top) is the enemy. It means the hypervisor is making your VM wait for CPU cycles.

Check your steal time right now:

vmstat 1 5

Look at the very last column (st). If it's consistently above 5-10%, your provider is squeezing you. Move to a provider that respects resource dedication.

Metric Acceptable Range Danger Zone
Load Average (per core) 0.0 - 0.7 > 1.0
Disk I/O Wait (wa) < 5% > 20%
Steal Time (st) 0% > 5%

Final Thoughts

Monitoring is not an afterthought; it is a feature. In 2014, we have the tools to visualize our infrastructure in real-time, but it requires a shift in mindset from "fixing it when it breaks" to "fixing it when the graph looks weird."

Don't let a slow disk queue kill your reputation. Build your monitoring stack on a solid foundation. Deploy a KVM instance on CoolVDS today and see what your infrastructure is actually doing.