Stop Pinging, Start Measuring: The Reality of Infrastructure Monitoring at Scale
It is 3:00 AM on a Tuesday. Your phone buzzes. It’s Nagios. Again. "CRITICAL: Load average is 15.2." You ssh in, half-awake, only to find the server idling comfortably. The load spike lasted four seconds—just enough to trigger the check, not enough to matter. You close the laptop, but you don't sleep. You know the next alert is coming.
This is the "Ping Era" of monitoring, and frankly, it is killing our productivity. In the Nordic hosting market, where we pride ourselves on engineering precision, we need to stop asking "Is the server up?" and start asking "How is the server behaving?"
I have spent the last decade debugging high-traffic LAMP stacks from Oslo to Bergen. I have seen servers melt not because of traffic, but because of bad monitoring architectures that hid the real problems. Today, we are going deep into system metrics, the lies of "noisy neighbor" virtualization, and how to build a monitoring stack using Graphite and Collectd that actually scales.
The Silent Killer: CPU Steal Time
Most VPS providers lie to you. They oversell their physical cores by a factor of 4 or 5, banking on the fact that not everyone will use 100% CPU at once. When everyone does hit the CPU (like during a backup window), your performance tanks, but your internal monitoring shows normal usage.
Enter CPU Steal Time (%st). This is the percentage of time your virtual CPU was ready to run a process but the hypervisor (the physical host) pulled the plug to serve another customer.
Pro Tip: If your %st is consistently above 5-10%, your provider is overselling. Move to a provider that uses strict KVM resource isolation, like CoolVDS, where a core is a core.
Here is how you catch a bad host red-handed using mpstat (part of the sysstat package on CentOS 6 and Debian 7):
# Install sysstat if you haven't already
yum install sysstat -y
# Check CPU stats every 1 second, 5 times
mpstat 1 5
Output analysis:
06:38:05 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
06:38:06 PM all 12.50 0.00 2.20 4.30 0.00 0.10 0.00 0.00 80.90
If that %steal column shows double digits, no amount of Nginx tuning will save you. You are fighting for scraps of silicon.
I/O Wait: The Bottleneck of 2013
We are currently seeing a massive shift in storage. Spinning rust (HDDs) cannot keep up with modern database demands. I recently migrated a Magento installation that was choking on MySQL queries. The CPUs were idle, but the site took 6 seconds to load.
The culprit? I/O Wait. The processor was sitting around waiting for the disk to write data. We moved the instance to a CoolVDS slice backed by Enterprise SSDs (using raw PCIe pass-through technology where possible), and load times dropped to 400ms.
To monitor this, rely on iostat. Watch the await column specifically:
iostat -x 1
avg-cpu: %user %nice %system %iowait %steal %idle
5.20 0.00 1.10 25.40 0.00 68.30
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
vda 0.00 4.00 0.00 15.00 0.00 152.00 10.13 0.80 53.20 4.10 6.15
An await of 53ms on a database server is unacceptable. You want single digits. This is why we are aggressively rolling out NVMe storage concepts and high-speed SSD arrays across our Norwegian datacenters. Speed is not a luxury; it is a requirement for SEO.
The New Stack: Graphite + Collectd
Nagios is great for "Is it down?", but it is terrible for " Is it getting slower?". For trends, you need a time-series solution. In 2013, the power combo is Collectd (to gather metrics) and Graphite (to render them).
Unlike RRDTool/Munin, Graphite scales to thousands of metrics per second. You can correlate "Deployment Time" with "Apache Latency" instantly.
Configuring Collectd for MySQL Monitoring
Don't just monitor if MySQL is running. Monitor the InnoDB buffer pool. Here is a snippet for /etc/collectd/collectd.conf on a CentOS 6 box:
LoadPlugin mysql
<Plugin mysql>
<Database "coolvds_production">
Host "localhost"
User "monitoring"
Password "SuperSecret2013!"
Socket "/var/lib/mysql/mysql.sock"
MasterStats true
</Database>
</Plugin>
Pair this with a custom script to feed application metrics directly to Carbon (Graphite's listener) via netcat. It is brutally simple and effective:
#!/bin/bash
# send_metric.sh
# Usage: ./send_metric.sh my.metric.name 42
METRIC=$1
VALUE=$2
TIMESTAMP=$(date +%s)
echo "$METRIC $VALUE $TIMESTAMP" | nc -q0 graphite.coolvds.internal 2003
Data Sovereignty and The NIX Advantage
We must address the elephant in the room: Data Privacy. With the recent news regarding PRISM and the NSA, hosting data outside of Europe has become a liability. The Norwegian Personal Data Act (Personopplysningsloven) sets strict standards for how we handle customer data.
By hosting on CoolVDS, your data sits in Oslo. We peer directly at NIX (Norwegian Internet Exchange). This serves two purposes:
- Compliance: Your data remains under Norwegian jurisdiction, safe from the overreach of the US Patriot Act.
- Latency: If your customers are in Trondheim or Stavanger, routing traffic through Frankfurt or London is inefficient. Direct peering drops latency from 40ms to 4ms.
Architecting for Failure
Hardware fails. It is a fact of life. In a traditional dedicated server environment, a motherboard failure means 4 hours of downtime while a technician swaps parts. In our KVM cloud, we use distributed storage.
If the physical node hosting your VPS detects a fault, we can migrate your instance to a healthy node. However, you must configure your OS to handle this graceful reboot. Ensure your services start on boot.
# The old way (SysVinit)
chkconfig httpd on
chkconfig mysqld on
# The new way (if you are testing Ubuntu 13.04 or relying on Upstart)
# Check /etc/init/mysql.conf configuration
Conclusion
Monitoring is not about buying expensive software suites; it is about visibility. It is about knowing that your disk I/O is saturating before your customers complain about checkout times. It is about understanding that "Virtual CPU" is meaningless if your provider allows noisy neighbors to steal your cycles.
We built CoolVDS because we were tired of debugging other people's infrastructure. We offer pure KVM virtualization, industry-leading low latency connectivity in Norway, and storage arrays that don't choke under load.
Don't let slow I/O kill your SEO or your sleep. Deploy a test instance on CoolVDS today and see what VPS Norway performance should actually look like.