The Watcher That Watches: When Your Monitoring Stack Becomes the Bottleneck
It’s 3:00 AM. Your pager is silent. Is the infrastructure stable, or has the monitoring server crashed under its own weight? If you are running a default Nagios or Zabbix installation against more than 500 nodes, it’s likely the latter. Everyone talks about scaling the web application, but few discuss the crushing I/O load generated by the monitoring systems themselves.
In the wake of recent leaks regarding data surveillance (the PRISM news breaking this week is a wake-up call for us all), where you host your infrastructure data is just as critical as how you monitor it. As a System Administrator in the Nordic market, I’ve seen robust architectures crumble not because the web servers failed, but because the monitoring node hit 100% I/O wait and failed to alert on a disk failure.
Let’s stop clicking around GUIs. We are going to look at the raw configuration required to scale monitoring on Linux, manage disk latency, and why the underlying virtualization technology determines whether you sleep at night.
The Silent Killer: Disk I/O Latency
Most SysAdmins throw CPU cores at a slow Zabbix server. This is a mistake. Monitoring systems are write-heavy. Zabbix, for instance, writes history data, trends, and events to the database continuously. On a standard rotating HDD (even a SAS 15k drive), the random write operations will cap out your IOPS long before the CPU sweats.
Check your iostat. If your %iowait is consistently above 5-10%, your monitoring is lagging.
$ iostat -x 1 10
avg-cpu: %user %nice %system %iowait %steal %idle
5.20 0.00 2.10 25.40 0.00 67.30
Device: rrqm/s wrqm/s r/s w/s svctm %util
sda 0.00 12.00 4.00 150.00 6.50 98.00
See that 25.40 %iowait? Your kernel is waiting for the disk. This is where CoolVDS differs from the budget providers. We prioritize SSD storage arrays for our KVM instances. In 2013, running a high-frequency polling system on magnetic storage is negligence. You need the random write performance of Solid State Drives to keep the queue length down.
Tuning MySQL for Zabbix 2.0
With Zabbix 2.0 becoming the standard this year, the database backend is the heart of the operation. The default my.cnf on CentOS 6 or Debian Wheezy is garbage for this workload. It is tuned for small websites, not millions of metrics.
The most controversial but necessary change involves data safety versus speed. For a monitoring server, strict ACID compliance on every transaction might be overkill if it kills performance.
Here is the my.cnf configuration I deployed last week for a client monitoring 1,200 VMs:
[mysqld]
# Allocate 70-80% of RAM to the buffer pool on a dedicated box
innodb_buffer_pool_size = 4G
# The controversial flag.
# 0 = Write to disk once per second.
# 1 = Write to disk on every commit (Safest, Slowest).
# 2 = Write to OS cache on commit, flush to disk every second.
innodb_flush_log_at_trx_commit = 2
# Keep data separate to avoid massive ibdata1 files
innodb_file_per_table = 1
# Essential for high concurrency
innodb_log_file_size = 256M
innodb_log_buffer_size = 8M
# Optimize for SSDs (if you are on CoolVDS)
innodb_io_capacity = 2000
Pro Tip: Setting innodb_flush_log_at_trx_commit = 2 means you might lose 1 second of monitoring data if the OS crashes. For 99% of use cases, this trade-off is worth the massive performance gain.
Nagios: Fixing the Check Latency
If you prefer Nagios (or forks like Icinga), the bottleneck is usually the fork overhead. Every time Nagios runs a check, it forks a new process. On a large scale, this kills the CPU.
To mitigate this, use the nagios.cfg to optimize how checks are reaped. Don't let your service check latency drift above 10 seconds.
# nagios.cfg optimizations
# Use large installation tweaks
use_large_installation_tweaks=1
# Cache the check results to reduce disk I/O
check_result_reaper_frequency=2
# Avoid checking for orphaned checks too often
check_for_orphaned_services=1
check_for_orphaned_hosts=1
# Service check timeout - kill them fast if they hang
service_check_timeout=30
The KVM Advantage
Why do we insist on KVM (Kernel-based Virtual Machine) at CoolVDS? Because in an OpenVZ or containerized environment (which many budget hosts use to oversell resources), you share the kernel with neighbors. If a neighbor gets DDoS'd, your iowait spikes, and your monitoring generates false positives.
With KVM, you have a dedicated kernel and reserved RAM. Your monitoring system must be isolated. False alerts destroy trust in the system. When your phone buzzes, it should be real.
Data Sovereignty: The Norwegian Context
We need to address the elephant in the room. The geopolitical landscape of hosting changed this week. Data passing through US-owned servers or networks is subject to scrutiny that many European businesses find unacceptable.
Under the Norwegian Personal Data Act (Personopplysningsloven), you have a responsibility to secure sensitive infrastructure data. Monitoring data often reveals architecture diagrams, patch levels, and vulnerable endpoints.
| Factor | International Cloud | CoolVDS (Norway) |
|---|---|---|
| Latency to NIX (Oslo) | 25-40ms | < 2ms |
| Jurisdiction | US / EU Mix | Norway (Datatilsynet) |
| Resource Dedication | Shared/Burstable | Dedicated KVM Resources |
Verifying Connectivity from the Shell
Before you deploy, verify your connectivity to the Nordic backbone. High latency in monitoring leads to gaps in graphs (the dreaded "broken lines" in Graphite/Carbon). Use MTR (My Traceroute) to check the path quality from your terminal:
$ mtr --report --cycles 10 nix.no
HOST: monitoring-node-01 Loss% Snt Last Avg Best Wrst StDev
1.|-- gw.coolvds.net 0.0% 10 0.4 0.4 0.3 0.5 0.1
2.|-- nix-gw.oslo.no 0.0% 10 1.2 1.1 1.0 1.3 0.1
3.|-- core.nix.no 0.0% 10 1.5 1.5 1.4 1.6 0.0
If you aren't seeing single-digit latency to the Oslo exchange, your "local" hosting might actually be routing through Frankfurt or London.
Conclusion
Infrastructure monitoring is not a "set and forget" task. It requires tuning the database, understanding the OS I/O scheduler, and choosing the right underlying hardware. Don't let magnetic disks and shared kernels hide the reality of your infrastructure.
Ensure your data stays within Norwegian borders and your IOPS belong only to you. Deploy a high-performance SSD KVM instance on CoolVDS today and see what you've been missing.