Console Login

Scaling Infrastructure Monitoring: Surviving the I/O Bottleneck with Graphite and KVM

Stop Letting False Positives Wake You Up at 3 AM

It’s 03:14 on a Tuesday. Your pager goes off. Nagios says the main MySQL cluster in the Oslo data center is down. You scramble to your terminal, sweat forming, and SSH in. Uptime is perfect. Load average is 0.5. The database is humming. So what happened?

Your monitoring server choked.

In the last year, I’ve seen this scenario play out in half the startups deploying across the Nordics. We are collecting more metrics than ever—CPU steal, disk queues, Memcached hits—but we are dumping them into antiquated architectures that rely on synchronous polling and heavy disk I/O. When you try to write 50,000 RRD (Round Robin Database) files simultaneously on a standard spinning disk or a cheap oversold VPS, the disk queue spikes, the check times out, and you get a false alarm.

If you are serious about infrastructure in 2013, you need to stop polling and start pushing. And you need hardware that doesn't lie to you about IOPS.

The Bottleneck is Disk I/O, Not CPU

Most legacy setups use Cacti or Nagios with PNP4Nagios. These tools rely heavily on RRDtool. Every time a metric comes in, RRDtool opens a file, seeks, reads, modifies, and writes. Do this for 500 servers with 20 metrics each, every 60 seconds. That is random write hell.

To diagnose if this is your problem, run `iostat` on your monitoring box. If your `%util` is hovering near 100% while CPU is idle, your disk subsystem is the bottleneck.

$ iostat -x 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.50    0.00    1.50   45.00    0.00   51.00

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    15.00    5.00  120.00    40.00  1500.00    12.32     2.50   15.00   8.00  98.50

See that `%util` at 98.50? Your monitoring server is practically dead. It cannot write the data fast enough. This is why at CoolVDS, for our internal monitoring, we strictly strictly use KVM instances backed by enterprise-grade SSD arrays. The random write performance of SSDs crushes the 15k SAS drives many hosts still try to sell you.

The Architecture Shift: Graphite & Carbon

The solution gaining traction right now among heavy users like Etsy is Graphite. Unlike Nagios which asks "Are you alive?", Graphite listens. Your servers send metrics to it (Push), and the Carbon daemon caches them in RAM, flushing them to disk in an optimized way.

This decouples metric collection from metric storage.

1. Configuring Storage Schemas

One common pitfall when setting up Graphite on CentOS 6 is the retention policy. By default, it might drop data too quickly. You need to configure `storage-schemas.conf` to handle high-resolution data for your recent troubleshooting and lower resolution for historical trends.

# /opt/graphite/conf/storage-schemas.conf

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[production_servers]
pattern = ^production\.
# 10s resolution for 6 hours, 1min for 7 days, 10min for 5 years
retentions = 10s:6h,1m:7d,10m:5y

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

This configuration is critical. If you don't define this correctly *before* you start sending data, resizing Whisper files later is a painful manual process involving `whisper-resize.py`.

2. Sending Metrics (The Python Way)

You don't need a heavy agent. You can send metrics using a simple Python script and the `socket` library. This is lightweight and creates zero load on your production nodes.

import socket
import time

CARBON_SERVER = '10.0.0.5'
CARBON_PORT = 2003

message = 'production.web01.loadavg_1min 0.45 %d\n' % int(time.time())

sock = socket.socket()
sock.connect((CARBON_SERVER, CARBON_PORT))
sock.sendall(message)
sock.close()

Hardware Reality: The "Noisy Neighbor" Problem

This brings us back to the infrastructure. You can have the most optimized Graphite config in the world, but if you are hosting it on an OpenVZ container oversold by a budget provider, you will suffer from "CPU Steal".

OpenVZ shares the host kernel. If your neighbor on the physical node decides to compile a kernel or run a heavy backup script, your `system` CPU usage spikes, and your monitoring lags. In 2013, with the rise of real-time bidding and high-frequency trading apps here in Oslo, you cannot afford jitter.

Pro Tip: Always check `/proc/user_beancounters` if you are on OpenVZ to see if you are hitting limits. Or better yet, switch to KVM. On CoolVDS KVM instances, resources are hard-allocated. Your RAM is yours. Your CPU cycles are yours. The kernel is yours.

Data Sovereignty and Datatilsynet

We also need to address the legal elephant in the room. Under the Norwegian Personal Data Act (Personopplysningsloven), you are responsible for where your data lives. If your monitoring system logs IP addresses or usernames, that is PII (Personally Identifiable Information).

Hosting this data outside the EEA (European Economic Area) introduces complex legal hurdles regarding Safe Harbor compliance. By keeping your monitoring infrastructure on servers physically located in Norway—hooked directly into NIX (Norwegian Internet Exchange)—you reduce latency and simplify compliance with Datatilsynet regulations. Latency from Oslo to a server in Frankfurt might only be 20ms, but latency to a server in Oslo is 1ms. When you are aggregating 10,000 metrics a second, that network overhead adds up.

Optimizing the Network Stack

If you are pushing thousands of metrics to a CoolVDS instance, the default Linux network stack might drop packets. Tune your `sysctl.conf` to handle the influx.

# /etc/sysctl.conf optimizations for high throughput

# Increase the maximum number of open files
fs.file-max = 65535

# Increase the backlog of incoming connections
net.core.somaxconn = 1024

# Widen the port range for outgoing connections (if you are the sender)
net.ipv4.ip_local_port_range = 1024 65000

# Fast recycling of TIME_WAIT sockets (use with caution behind NAT)
net.ipv4.tcp_tw_recycle = 1

Apply these with `sysctl -p`. These settings allow the Carbon daemon to accept more simultaneous connections without dropping data.

The Verdict

Monitoring is not just about installing Nagios and forgetting it. It is an active architectural choice. As we move deeper into 2013, the volume of data is only going to increase. Polling is dying. SSDs are becoming mandatory. And the stability of your virtualization platform is the foundation of it all.

If you are tired of debugging I/O wait instead of shipping code, it is time to upgrade.

Don't let slow I/O kill your uptime metrics. Deploy a high-performance, KVM-based Graphite stack on CoolVDS today and see what your infrastructure is actually doing.