Console Login

Scaling Infrastructure Monitoring in 2013: Why I/O Latency is Killing Your Zabbix Server

Scaling Infrastructure Monitoring in 2013: Why I/O Latency is Killing Your Zabbix Server

It is 3:00 AM. Your pager (or if you are lucky, your SMS gateway) just went off. The alert says the main database cluster is down. You scramble to the terminal, SSH in, and... everything is fine. The load is 0.5. The connectivity is perfect. The site is up.

Your monitoring system lied to you. Again.

In the last six months, working with high-traffic e-commerce platforms in Oslo, I’ve seen this pattern repeat ad nauseam. Sysadmins deploy Nagios or Zabbix on a cheap, oversold VPS, throw 500 hosts at it, and then wonder why they get false positives. The culprit is almost never the network—it’s the disk subsystems on your monitoring node choking on write operations.

The "I/O Wait" Silent Killer

Monitoring systems are database-heavy beasts. Zabbix, specifically, writes historical data and trends to MySQL (or PostgreSQL) constantly. If you are monitoring 200 servers with an interval of 30 seconds for 50 items each, you are generating hundreds of inserts per second. On a standard hosting node where resources are oversold (typical OpenVZ setups), your "steals" go up, and your I/O wait (`wa`) spikes.

When the disk queue fills up, the Zabbix poller processes stall. If they stall long enough, they mark hosts as unreachable. This is not a software bug; it is an architecture failure.

Let's look at what is actually happening under the hood. Run this on your monitoring box:

root@monitor01 [~]# iostat -x 1
Linux 2.6.32-358.el6.x86_64 (monitor01) 	08/19/2013 	_x86_64_	(4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.20    0.00    2.10   45.30    8.50   38.90

Device:         rrqm/s   wrqm/s     r/s     w/s   svctm   %util
vda               0.00    25.00    2.00  150.00    6.50   98.20

See that 45.30% iowait? See the 98.20% util? Your CPU is doing nothing but waiting for the hard drive platter to spin or the RAID controller to catch up. In a virtualized environment, this usually means the host node is overloaded by other tenants.

Pro Tip: If you see `%steal` above 5% consistently, your hosting provider is overselling CPU cycles. Move your workload immediately. This is why we insist on KVM virtualization at CoolVDS—hardware isolation matters.

Tuning Zabbix 2.0 for High Performance

Before you blame the hardware completely, ensure your zabbix_server.conf isn't set to defaults. The default configuration in the EPEL repositories for CentOS 6 is meant for small setups, not production environments.

Here are the parameters we tuned for a client monitoring a NIX (Norwegian Internet Exchange) connected infrastructure:

### /etc/zabbix/zabbix_server.conf

# Do not let the pollers sit idle, but don't spawn too many causing context switching
StartPollers=50
StartPingers=10

# CRITICAL: Increase cache sizes to reduce DB hits
CacheSize=256M
HistoryCacheSize=128M
TrendCacheSize=128M

# Keep the housekeeper in check to prevent massive delete operations locking tables
MaxHousekeeperDelete=500

Optimizing the MySQL Backend

Zabbix on MySQL (InnoDB) requires a buffer pool large enough to hold the active working set. If you are swapping to disk, you are dead. Ensure your my.cnf is optimized for the RAM available on your VPS.

[mysqld]
# Set this to 70-80% of total available RAM
innodb_buffer_pool_size = 4G

# Separate table spaces are a must for reclaiming space later
innodb_file_per_table = 1

# Flush method O_DIRECT avoids double buffering in OS cache
innodb_flush_method = O_DIRECT

# Don't sync to disk on every commit if you can tolerate 1 sec data loss
# This massively improves performance for Zabbix history writes
innodb_flush_log_at_trx_commit = 2

The Topology of Trust: Local vs. Remote

Latency is physics. If your infrastructure is primarily in Oslo, hosting your monitoring server in a budget data center in Texas is asking for trouble. A hiccup in the transatlantic fiber can look like a total outage on your dashboard.

For Norwegian businesses, there is also the compliance aspect. The Personopplysningsloven (Personal Data Act) and the Datatilsynet guidelines suggest keeping sensitive log data within the EEA. While metric data (CPU load, RAM usage) is rarely PII, log monitoring often inadvertently captures IP addresses or user names. Keeping your monitoring stack on a VPS in Norway simplifies your legal compliance posture significantly.

Why "CoolVDS" Architecture Solves the False Positive Loop

We built CoolVDS because we got tired of the "noisy neighbor" effect destroying our own monitoring stacks. When you deploy a monitoring node, you need consistent I/O performance, not "burstable" performance that disappears when you need it most.

Feature Generic OpenVZ VPS CoolVDS KVM Instance
Virtualization Shared Kernel (OS level) Full Hardware Virtualization (KVM)
Storage Backend Standard HDD / Shared caching Pure SSD RAID-10
Resource Guarantees Soft limits (Oversold) Dedicated RAM & CPU allocation
Swap Behavior Often fails or unavailable Full control over swap partitions

We utilize Solid State Drives (SSDs) in RAID-10 configurations. In 2013, SSDs are still considered a luxury by many budget hosts, but for database-driven applications like Zabbix or heavy MySQL workloads, spinning rust is obsolete. The random write performance (IOPS) of an SSD array allows Zabbix to insert thousands of metrics per second without the `iowait` spiking.

Deploying a Custom Check

To wrap this up, here is a quick win. Don't just monitor if a service is running; monitor its responsiveness. Here is a custom UserParameter to check local disk write speed directly from the agent, so you know if your host is degrading.

# /etc/zabbix/zabbix_agentd.conf
# WARNING: This creates a small file write every time it runs. Don't set interval < 5 min.
UserParameter=custom.disk.write_speed, dd if=/dev/zero of=/tmp/test.img bs=1M count=1 oflag=dsync 2>&1 | awk '/bytes/ {print $(NF-1)}'

Monitoring is the eyes and ears of your infrastructure. Don't let it go blind because of cheap storage. If you need a battle-tested environment with low latency to NIX and legitimate hardware isolation, it is time to upgrade.

Stop fighting `iowait`. Deploy your Zabbix server on a CoolVDS SSD instance today and sleep through the night.