Scaling Infrastructure Monitoring in 2014: Surviving the Transition from Nagios to Distributed Zabbix
There is a specific kind of fatigue that only a sysadmin knows. It’s the 3:00 AM buzz of a pager alerting you that web-node-04 is down, only for you to SSH in and find the server humming along perfectly. The culprit? A momentary packet loss at the switch or, worse, your monitoring server was so overloaded writing RRD files that it timed out its own checks.
If you are managing more than 50 servers, the "classic" single-node Nagios setup starts to show cracks. The check latency creeps up. The disk I/O wait creates gaps in your graphs. You aren't monitoring anymore; you're just hoping.
At CoolVDS, we see this constantly with clients migrating from shared hosting to our dedicated KVM slices. They bring their old monitoring habits with them. Today, we are going to architect a monitoring solution that scales, utilizing Zabbix 2.2 LTS (released late last year) and identifying the hardware bottlenecks that usually kill monitoring performance.
The Hidden Killer: Disk I/O and Database Locks
Most people assume monitoring is CPU intensive. It's not. It is an I/O punisher. Every single metric you collect—CPU load, free RAM, network traffic—requires a database write. If you are polling 200 servers with 50 items each every 30 seconds, you are hammering your database with thousands of inserts per minute.
On traditional spinning HDD VPS providers, this is where the system dies. The iowait spikes, the MySQL process locks up, and Zabbix starts dropping data.
Pro Tip: Never run your monitoring stack on OpenVZ or container-based virtualization if you can avoid it. You have no guarantee of I/O isolation. We enforce KVM virtualization on CoolVDS specifically so your heavy database writes don't get throttled by a neighbor's backup script.
Step 1: Tuning the Zabbix Server
Out of the box, Zabbix is configured for a hobbyist, not a high-traffic environment. The default polling processes are insufficient for reliable checks across the NIX (Norwegian Internet Exchange) or broader European networks.
Here is the /etc/zabbix/zabbix_server.conf configuration we use as a baseline for mid-sized infrastructure:
### Advanced Performance Tuning
StartPollers=80
StartIPMIPollers=5
StartPollersUnreachable=40
StartTrappers=20
StartPingers=20
# Cache Sizes - Critical for performance
CacheSize=128M
HistoryCacheSize=64M
TrendCacheSize=32M
HistoryTextCacheSize=64M
# Database housekeeping
HousekeepingFrequency=1
MaxHousekeeperDelete=5000
Increasing the CacheSize is mandatory. If Zabbix cannot store configuration data in RAM, it hits the disk for every check. On standard SATA storage, this is fatal. On our Pure SSD instances, you have more headroom, but caching remains best practice.
Step 2: Monitoring the Monitor (MySQL Tuning)
Zabbix is essentially a frontend for a MySQL database. If MySQL is slow, Zabbix is broken. With the release of MySQL 5.6, we have better performance defaults, but you still need to optimize InnoDB for write-heavy workloads.
Ensure your my.cnf is set to utilize your available RAM:
[mysqld]
# Set to 70-80% of total RAM on a dedicated monitoring node
innodb_buffer_pool_size = 4G
innodb_buffer_pool_instances = 4
# SSD Optimization
innodb_flush_neighbors = 0
innodb_io_capacity = 2000
The innodb_flush_neighbors = 0 setting is crucial for SSDs. It tells MySQL that it doesn't need to group writes to minimize seek time, because SSDs (unlike HDDs) have no seek time penalties.
Step 3: Custom Metrics with UserParameters
Ping checks tell you if a server is up. They don't tell you if it's working. To monitor application health, you need custom agents. Let's say you are running Nginx on a CoolVDS instance serving a Norwegian e-commerce site. You need to know the active connections.
First, enable the stub status in /etc/nginx/sites-available/default:
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
Next, define a UserParameter in your zabbix_agentd.conf to grep this data:
UserParameter=nginx.active[*],wget -O- -q http://127.0.0.1/nginx_status | awk '/Active/ {print $NF}'
UserParameter=nginx.reading[*],wget -O- -q http://127.0.0.1/nginx_status | awk '/Reading/ {print $2}'
UserParameter=nginx.writing[*],wget -O- -q http://127.0.0.1/nginx_status | awk '/Writing/ {print $4}'
This allows you to graph connection spikes in real-time, correlating them with marketing campaigns or DDoS attempts.
The Connectivity Factor: Latency and False Positives
In Norway, network topology matters. If your monitoring server is hosted in Germany but your clients are in Oslo, a hiccup in the peering at Hamburg can trigger mass alerts. This is "alert fatigue," and it leads to sysadmins ignoring real problems.
We recommend placing your monitoring infrastructure as close to your production gear as possible. For local compliance with the Personal Data Act (Personopplysningsloven) and guidance from Datatilsynet, keeping log data within national borders is also a strategy many CTOs are adopting to simplify legal headaches, even before we discuss the upcoming stricter EU data directives.
Comparison: Monitoring Storage Backends
| Storage Type | Random Write IOPS | Suitability for Zabbix/Graphite |
|---|---|---|
| 7.2k RPM SATA | ~80-100 | Poor. Only for very small setups (<10 hosts). |
| 15k RPM SAS | ~180-200 | Moderate. Decent for mid-size, but expensive. |
| CoolVDS SSD | 50,000+ | Excellent. Handles massive insert rates without lag. |
Automating the Agent Deployment
Finally, never install agents manually. It is 2014; we have tools for this. Whether you prefer Puppet, Chef, or CFEngine, automation is key to consistency.
Here is a simple snippet for Puppet to ensure the Zabbix agent is always running and config is consistent:
package { 'zabbix-agent':
ensure => installed,
}
service { 'zabbix-agent':
ensure => running,
enable => true,
hasrestart => true,
require => Package['zabbix-agent'],
subscribe => File['/etc/zabbix/zabbix_agentd.conf'],
}
file { '/etc/zabbix/zabbix_agentd.conf':
ensure => present,
content => template('zabbix/zabbix_agentd.conf.erb'),
require => Package['zabbix-agent'],
}
Conclusion
Monitoring is not just about installing software; it is about architecture. You need the right configuration, the right automation, and fundamentally, the right hardware. A monitoring system that lags is worse than no monitoring system at all.
If you are tired of I/O wait stealing your sleep, it is time to upgrade your foundation. CoolVDS offers high-performance SSD VPS instances in Norway with the low latency and I/O throughput required for serious DevOps work.
Ready to stabilize your infrastructure? Deploy a CoolVDS SSD instance today and stop guessing about your uptime.