The 3:00 AM Pager Duty Nightmare
It is a scenario every System Administrator in Oslo knows too well. It is 03:14 on a Tuesday. Your phone buzzes. The web server is down. You scramble to your laptop, SSH in, and verify that httpd is dead. You restart it. It works. You go back to sleep.
But you have no idea why it crashed. Was it a memory leak? A traffic spike from the US? A script kiddy running a DDoS? Without historical data and proactive alerting, you are just a glorified firefighter putting out flames with a water pistol.
In 2010, relying on users to tell you the site is down is professional negligence. We need a two-pronged approach: Nagios for the "Is it up?" alerts, and Munin for the "How is it performing?" graphs.
The Watchdog: Configuring Nagios 3
Nagios is the industry standard for a reason. It doesn't care about pretty graphs; it cares about binary states: OK, WARNING, CRITICAL, or UNKNOWN. If you aren't running Nagios 3.2, you are flying blind.
The mistake most admins make is monitoring only PING. A server can respond to a ping while MySQL is deadlocked and Apache is serving 500 errors. You need to monitor the service, not just the metal.
Here is a proper service definition for checking HTTP load on a remote web server. Put this in your /usr/local/nagios/etc/objects/commands.cfg:
define command{
command_name check_http_url
command_line $USER1$/check_http -H $HOSTADDRESS$ -u $ARG1$
}And in your host definition:
define service{
use generic-service
host_name web-node-01
service_description Magento Front Page
check_command check_http_url!/index.php
}This checks that the actual PHP code is executing, not just that the port is open.
The "CoolVDS" Reliability Factor
Pro Tip: False positives kill morale. If your VPS provider oversubscribes their CPU, Nagios will timeout simply because the host node is lagging. At CoolVDS, we use strict Xen-based virtualization. This means your CPU cycles are reserved. If Nagios says the server is slow, it is actually your code, not our hypervisor choking on a neighbor's process.
The Historian: Trending with Munin
Nagios wakes you up; Munin tells you what happened. Munin uses RRDTool to graph system metrics over time. This is critical for capacity planning. If you see disk usage creeping up by 2% daily, you can buy more storage weeks before the crash occurs.
Installing Munin on Ubuntu 10.04 LTS (Lucid Lynx) or CentOS 5.5 is straightforward, but the configuration requires security discipline. By default, the node listens on port 4949.
Open /etc/munin/munin-node.conf and ensure you restrict access. You do not want competitors seeing your traffic graphs.
# /etc/munin/munin-node.conf
allow ^127\.0\.0\.1$
allow ^10\.0\.0\.5$ # Your central monitoring server IPDon't forget to restart the node:
/etc/init.d/munin-node restartSpotting the "Steal Time"
One specific graph in Munin separates the amateur hosting from the professional grade: CPU Steal Time. This metric measures the time your virtual CPU spends waiting for the physical hypervisor to give it attention.
On cheap, oversold OpenVZ containers, you will often see Steal Time spike during peak hours. This destroys database performance. Because CoolVDS focuses on low latency and guaranteed resources, our Steal Time graphs are flatlining at zero. You get the I/O and CPU you pay for.
Data Privacy and Norwegian Law
Monitoring logs contain sensitive data. IP addresses, error logs with user inputs—this falls under the jurisdiction of the Personal Data Act (Personopplysningsloven) and the Data Inspectorate (Datatilsynet).
Hosting your monitoring server outside the EEA creates a legal headache regarding data export. By keeping your monitoring infrastructure on CoolVDS servers located in Oslo, you ensure that your system logs remain within Norwegian jurisdiction, adhering to the 95/46/EC Directive without complex legal gymnastics.
Implementation Strategy
Don't try to monitor everything on day one. Start here:
- Deploy a central CoolVDS instance dedicated to monitoring. Don't run Nagios on the same server you are monitoring (if the server dies, who alerts you?).
- Install Nagios 3 and set up alerts for Load, Disk Space, and HTTP response.
- Install Munin Node on all production servers.
- Set up SMS alerts. Email is for logs; SMS is for emergencies.
The difference between a frantic 3 AM panic and a calm Tuesday morning is visibility. Don't guess. Measure.
Ready to stabilize your infrastructure? Deploy a monitoring instance on CoolVDS today. Our Xen-based VPS offers the stability your Nagios checks require to be accurate.