Is Your Server Plotting to Kill Your Sleep Schedule?
It is 3:17 AM on a Tuesday. Your phone is buzzing on the nightstand. You know exactly what it is before you even look. The database is down.
Again.
If you are managing servers for clients here in Norway or across Europe, you cannot rely on "hoping it stays up." Hope is not a strategy. As a sysadmin who has spent too many nights debugging MySQL crashes in cold server rooms, I can tell you that the only way to survive is proactive monitoring. We need to know the server is sick before it dies.
Today, we represent the "Gold Standard" of open-source monitoring in 2011: Nagios for immediate alerting and Munin for historical trending.
The Dynamic Duo: Why You Need Both
Many admins make the mistake of choosing just one. This is wrong. They solve different problems.
- Nagios is your watchdog. It asks binary questions: Is the web server up? Is disk space under 90%? If the answer is no, it screams at you.
- Munin is your historian. It graphs data over time. It tells you: "Your memory usage has been creeping up by 2% every day for the last month."
You need Munin to explain why Nagios woke you up.
Step 1: The Nagios Watchdog
Installing Nagios 3 on a Debian Squeeze (6.0) or CentOS 5 system is straightforward, but the configuration is where people get lazy. Don't just ping the server. A server responding to ICMP ping can still have a stuck Apache process.
Here is a snippet for checking a local service in /etc/nagios3/conf.d/services.cfg. This checks if SSH is responsive, not just if the port is open:
define service {
host_name web01.coolvds.no
service_description SSH
check_command check_ssh
use generic-service
notification_interval 0 ; only notify once
}
Pro Tip: Set your notification_interval to 0 for non-critical warnings. You do not need an email every 30 minutes telling you the disk is 85% full. You need one email. Fix it, or acknowledge it.
Step 2: visualizing the Rot with Munin
Munin is essentially a wrapper for RRDTool. It is ugly, but it is honest. When you deploy a VPS, especially for resource-heavy applications like Magento, you need to see I/O wait times.
To enable the MySQL plugins on your node:
ln -s /usr/share/munin/plugins/mysql_queries /etc/munin/plugins/
ln -s /usr/share/munin/plugins/mysql_slowqueries /etc/munin/plugins/
/etc/init.d/munin-node restart
If you see a spike in "slow queries" on the graph at the exact same time Nagios reported high load, you have found your culprit. No guessing required.
The Hardware Factor: Not All VPSs Are Equal
Monitoring software can only do so much if the underlying hardware is choking. In the hosting market right now, there is a lot of noise about "cloud," but physics still applies.
If your provider creates high I/O Wait, your load average skyrockets even if your CPU is idle. This is the "noisy neighbor" effect common in cheap OpenVZ containers.
At CoolVDS, we utilize KVM virtualization. This provides true hardware isolation. When Munin reports 50% CPU usage on a CoolVDS instance, you are actually using 50% of the core, not waiting for another customer's PHP script to finish execution. Stability requires predictable resources.
Network Latency and Geography
For Norwegian businesses, the physical location of the monitoring server matters. If your Nagios instance is in Texas and your server is in Oslo, you will get false positives every time a transatlantic fiber line hiccups.
Hosting your monitoring infrastructure locally—or utilizing a provider with direct peering to NIX (Norwegian Internet Exchange)—reduces false alarms. It also keeps you compliant with the Personal Data Act (Personopplysningsloven) and Datatilsynet guidelines regarding where log data containing IP addresses is stored.
Configuration Checklist for Production
Before you sign off on a deployment, ensure these checks are active:
| Check Type | Tool | Threshold |
|---|---|---|
| Disk Space (Root) | Nagios | Warn at 85%, Critical at 95% |
| RAID Status | Nagios (nrpe) | Critical on any degradation |
| Inode Usage | Munin | Graph usage (beware failing mail queues) |
| Apache/Nginx Conn | Munin | Monitor for sudden traffic spikes (DDoS) |
Don't Wait for the Crash
Setting up proper monitoring takes about an hour. Rebuilding a corrupted database after a disk fills up takes all day. The math is simple.
If you are looking for a platform that respects your need for stability, high-performance SSD storage (still a rarity in 2011!), and low latency connectivity within Scandinavia, give our infrastructure a look. We don't oversell, and we don't interfere with your kernel.
Ready to secure your uptime? Spin up a CoolVDS KVM instance and install Nagios today.