Console Login

Stop Guessing: A Sysadmin’s Guide to Monitoring High-Load Infrastructure in 2013

Stop Guessing: A Sysadmin’s Guide to Monitoring High-Load Infrastructure

It is 3:00 AM on a Tuesday. Your phone buzzes. It’s not a text from a friend; it’s PagerDuty. Your primary database node has locked up. Again. If you are reading this, you know the feeling of dread that settles in the pit of your stomach as you stumble to your laptop, SSH into a sluggish server, and pray `top` loads before the timeout.

In the wake of the recent PRISM leaks, we are all taking a harder look at where our data lives and who controls the metal it runs on. But sovereignty means nothing if the server is down. In the Nordic hosting market, specifically here in Norway, we deal with a unique set of constraints: high expectations for latency, strict adherence to the Personal Data Act (Personopplysningsloven), and the need for rock-solid stability.

Most VPS providers lie to you about dedicated resources. They oversell CPU cycles and assume you won't notice until your steal time (`st`) hits 20%. I’ve seen production stacks melt down not because of bad code, but because a "noisy neighbor" on the same physical host decided to recompile a kernel. Here is how we fix that.

The Philosophy: If You Can't Graph It, It Doesn't Exist

There is a dangerous trend in 2013 to rely on "cloud" dashboards provided by vendors. These are often delayed by 5 to 15 minutes. In a high-traffic scenario—say, a Magento store running a flash sale—15 minutes is an eternity.

You need granular control. For serious infrastructure, we are looking at two main contenders: Nagios for alerting (is it up or down?) and Zabbix 2.0 for trending (is it getting slower?).

1. The Watchdog: Configuring Nagios for Instant Alerts

Nagios Core is the industry standard for a reason. It is ugly, it is text-based, and it works. The mistake most admins make is relying on default checks. Defaults are for hobbyists.

When I deploy a monitoring node, I need to know if my web server (Nginx or Apache) is struggling before it crashes. We use NRPE (Nagios Remote Plugin Executor) to run checks locally on the target server.

Here is a hardened nrpe.cfg snippet for checking load limits on a quad-core web node. Note the warning threshold set relatively low to catch spikes early:

# /usr/local/nagios/etc/nrpe.cfg

# LOAD CHECK
# Warn if 15-min load avg > 4.0, Crit if > 6.0
command[check_load]=/usr/local/nagios/libexec/check_load -w 4.0,4.0,4.0 -c 6.0,6.0,6.0

# DISK USAGE
# Critical if less than 10% free space on root
command[check_disk]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /

# ZOMBIE PROCS
# Catch runaway processes early
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z

Don't just check if port 80 is open. Check if the application is actually rendering. A simple TCP check returns OK even if your PHP-FPM pool is deadlocked. Use `check_http` with string matching to verify your footer loads.

2. The Silent Killer: Disk I/O Latency

In virtualized environments, Disk I/O is the scarcest resource. Traditional spinning HDDs (even SAS 15k) simply cannot handle the random read/write patterns of a modern MySQL database under load. This is where the battle is won or lost.

If you are seeing high iowait in `top`, your disk subsystem is choking. You can diagnose this with `iostat`.

$ iostat -x 1 10
Linux 2.6.32-358.el6.x86_64 (web01.coolvds.net) 	06/26/2013

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           12.40    0.00    3.10   45.20    0.00   39.30

Device:         rrqm/s   wrqm/s     r/s     w/s   svctm   %util
vda               0.00    12.00   45.50   102.00   8.50   98.20

See that 45.20% iowait? That server is useless. The CPU is sitting idle waiting for the disk to fetch data. In 90% of cases, this happens because you are on a host with slow storage or too many neighbors fighting for the same spindle.

Pro Tip: Moving to SSD-based storage is not a luxury anymore; it is a requirement for database workloads. CoolVDS uses Pure SSD RAID-10 arrays. In our benchmarks, this reduces I/O wait from ~40% to under 1% for the same MySQL workload compared to standard SATA VPS providers.

3. Database Performance Tuning

Monitoring the OS is half the battle. You must monitor the engine. For MySQL 5.5 (or the newer 5.6), the InnoDB buffer pool is critical. You want your dataset to fit in memory to avoid hitting that disk we just talked about.

Use this query to check your buffer pool hit rate. If it's below 99%, you need more RAM or better configuration:

SELECT 
  (Innodb_buffer_pool_read_requests - Innodb_buffer_pool_reads) / Innodb_buffer_pool_read_requests * 100 
  AS Buffer_Pool_Hit_Rate
FROM information_schema.GLOBAL_STATUS;

If you need to adjust it, edit your `my.cnf`. Be careful not to allocate more than 70-80% of total RAM on a dedicated DB node, or the OS will start swapping.

# /etc/my.cnf
[mysqld]
# Set to 70% of available RAM for a dedicated node
innodb_buffer_pool_size = 4G
innodb_log_file_size = 256M
innodb_flush_log_at_trx_commit = 2 # Trade slight ACID compliance for massive speed

Why Architecture Matters: The CoolVDS Approach

We built CoolVDS because we were tired of "black box" hosting. When you run a `traceroute` from Oslo, you want to see your packets hit the NIX (Norwegian Internet Exchange) and drop straight into your data center. You don't want them routed through Frankfurt or London unnecessarily.

We use KVM (Kernel-based Virtual Machine) virtualization exclusively. Unlike OpenVZ, KVM provides true hardware virtualization. This means:

  • Kernel Customization: You can install your own kernel modules (essential for specialized VPNs or file systems).
  • Resource Isolation: Your RAM is your RAM. It is hard-reserved.
  • Security: Better isolation between tenants—critical for compliance with Datatilsynet regulations regarding data privacy.

For high-performance clusters, we recommend a split architecture:

Role Recommended Specs (2013 Standard) CoolVDS Solution
Load Balancer 1 vCPU, 512MB RAM Nginx Reverse Proxy
App Server 2 vCPU, 4GB RAM High-Freq Compute
Database 4 vCPU, 8GB+ RAM, SSD SSD Optimized Instance

Bash Scripting for Quick Checks

Sometimes, Zabbix is too slow. If you are debugging a live incident, you need raw data. I keep this simple loop running in a tmux session during migrations or stress tests:

#!/bin/bash
# fast_monitor.sh
# Quick polling for load, memory, and disk IO

while true; do
    echo "--- $(date) ---"
    uptime | awk '{print "Load: " $10 $11 $12}'
    free -m | grep Mem | awk '{print "RAM Used: " $3 "MB / " $2 "MB"}'
    # Check for connections on port 80
    netstat -an | grep :80 | wc -l | awk '{print "Active HTTP Conns: " $1}'
    sleep 5
done

Final Thoughts: Trust, But Verify

In our line of work, paranoia is a virtue. The recent NSA revelations have taught us that data sovereignty is not just a legal checkbox; it is a fundamental component of trust. Hosting your infrastructure within Norway, on hardware you can monitor down to the block device, is the only way to guarantee performance and privacy.

You can spend hours tweaking `sysctl.conf` and optimizing Nginx buffers, but if your underlying host is oversold, you are building a castle on sand.

Don't let slow I/O kill your uptime. Deploy a KVM test instance on CoolVDS in 55 seconds and see the difference real hardware isolation makes.