If everything is urgent, nothing is urgent.
It’s 3:14 AM. Your phone buzzes. It’s Nagios again. "CPU Load High on db-node-04." You ssh in, run top, and see... nothing. The load average spiked for ten seconds because a backup script ran, and now it’s back to idle. You close your laptop, but your sleep cycle is ruined.
I have seen entire sysadmin teams burn out because of this. In 2015, with infrastructure scaling horizontally via tools like Ansible and Puppet, the old way of monitoring—checking if a ping returns—is dead. If you are managing twenty, fifty, or a hundred servers, you need actionable intelligence, not noise.
The "Steal Time" Trap
Most VPS providers in Europe oversell their CPU cores. They pile too many tenants onto a single hypervisor. When your neighbor’s WordPress site gets hit by a botnet, your database slows down. This shows up in your monitoring as %st (steal time).
I recently audited a Magento cluster for a client in Oslo. They were plagued by random timeouts. Their previous host blamed the PHP code. I ran one command:
iostat -c 1 5The steal time was hovering around 15%. Their virtual CPU was waiting for the physical CPU to become available. You cannot tune your way out of noisy neighbors. This is why at CoolVDS, we rely strictly on KVM (Kernel-based Virtual Machine) with strict resource limits. If you buy 4 cores, you get 4 cores. No magic, no overcommit, just raw compute.
Tuning Zabbix for Reality, Not Theory
We use Zabbix 2.4 extensively. It is powerful, but out of the box, it is too sensitive. Here is how we tune triggers to respect the reality of the Norwegian internet infrastructure.
1. Stop Alerting on Spikes
A CPU spiking to 90% for 30 seconds is not an incident; it's a computer doing its job. Alert only if the condition persists.
Bad Trigger: {host:system.cpu.load[percpu,avg1].last()}>5
Good Trigger: {host:system.cpu.load[percpu,avg1].min(5m)}>5
This ensures the load has been high for at least 5 minutes before waking you up.
2. Monitor I/O Latency, Not Just Usage
Disk space alerts are boring. Disk latency is where the fire starts. If you are running MySQL or PostgreSQL, `iowaits` will kill your application long before the CPU maxes out. We use a custom UserParameter to track read/write operations per second.
Pro Tip: If your disk wait time consistently exceeds 20ms, you are on the wrong hardware. Standard SSDs are good, but for high-transaction databases, we are seeing incredible results with the new NVMe storage tiers available in our Oslo datacenter. The IOPS difference is not linear; it is exponential.
The Local Latency Advantage
Monitoring is also about external availability. If your target audience is in Norway, why are you pinging your servers from Texas? Network routes matter.
We peer directly at NIX (Norwegian Internet Exchange). When setting up your monitoring probes, place them geographically close to your users. A 30ms latency spike from a probe in Frankfurt might look like a server issue, but it could just be a congested route through Sweden. By hosting on CoolVDS in Norway, you eliminate the cross-border hops that often trigger false latency alerts.
Log Aggregation: The Next Step
Once you have metrics handled, you need logs. Grepping through /var/log/syslog across ten servers is impossible. We are currently rolling out the ELK Stack (Elasticsearch, Logstash, Kibana) for our internal systems. Piping your Nginx logs to Logstash allows you to visualize 500 errors in real-time.
However, ELK is heavy (Java loves RAM). Do not run it on the same web server hosting your application. Deploy a dedicated instance. A 4GB RAM VPS is usually the minimum entry point for a stable Logstash indexer.
Conclusion
Monitoring is not about collecting data; it is about filtering it. You need a baseline you can trust. That starts with hardware that doesn't fluctuate based on what other customers are doing.
Stop fighting false positives. Migrate your critical monitoring nodes and production workloads to an environment that respects your need for stability.
Need stable I/O? Deploy a KVM instance on CoolVDS today and see what 0% steal time looks like.