The Silence of the Logs
It’s 3:00 AM. Your phone buzzes. It’s not a text from a friend; it’s PagerDuty, or worse, an angry client calling because their Magento storefront is throwing 502 Bad Gateway errors. If you are running infrastructure in 2012 without comprehensive monitoring, you are not a Systems Administrator—you are a firefighter with a blindfold.
Most VPS providers in the current market sell you "guaranteed RAM" and then stuff 50 users onto a single physical node using OpenVZ. When one neighbor compiles a kernel, your I/O wait spikes, and your site crawls. You can't fix what you can't measure.
In this guide, we aren't just installing packages. We are building a telemetry deck. We will use Nagios for immediate alerting (the "Heartbeat") and Munin for retrospective analysis (the "Brain"). We will deploy this on a standard CentOS 6.3 environment, utilizing the kind of dedicated resource stability you only find on platforms like CoolVDS.
The Philosophy: Binary State vs. Analog Trends
Many junior admins confuse monitoring with alerting. They are distinct disciplines.
- Nagios answers the question: "Is the service alive?" It is binary. Red or Green. It wakes you up.
- Munin answers the question: "Why did the service die?" It is analog. It draws graphs over days, weeks, and months. It lets you sleep.
I recently audited a high-traffic news portal hosted here in Oslo. They suffered random downtime every Tuesday at 04:00. Nagios just said "CRITICAL: Socket Timeout." Useless for debugging. It was Munin that revealed a massive spike in MySQL InnoDB buffer pool usage coinciding exactly with a misconfigured cron backup script that was locking tables. Without graphs, we would still be blaming the firewall.
Part 1: The Watchdog (Nagios Core 3.4)
Nagios is the industry standard for a reason. It is ugly, config-heavy, and absolutely bulletproof. We aren't using the packaged RPMs because we need control over the compilation flags to keep the footprint small on our monitoring node.
Prerequisites & Compilation
First, ensure you have the build essentials. On a clean CoolVDS instance (I recommend the 1GB RAM plan for a monitoring node), run:
yum install -y httpd php gcc glibc glibc-common gd gd-devel make net-snmp
Download Nagios Core 3.4.1 (the stable release as of June). Don't use the bleeding edge unless you like fixing segmentation faults.
useradd nagios
groupadd nagcmd
usermod -a -G nagcmd nagios
tar zxvf nagios-3.4.1.tar.gz
cd nagios
./configure --with-command-group=nagcmd
make all
make install
make install-init
make install-config
make install-commandmode
Configuring the 'Check' Logic
The magic happens in objects/commands.cfg. A common mistake is relying on the default ping check, which can be blocked by over-aggressive iptables rules at the datacenter level. We prefer checking the SSH port or the HTTP header directly.
Here is a robust command definition for checking web latency that doesn't just check if port 80 is open, but validates the response time:
define command{
command_name check_http_latency
command_line $USER1$/check_http -H $HOSTADDRESS$ -w 2 -c 5 -t 10
}
Pro Tip: Set your timeout (-t) higher if you are monitoring servers across the Atlantic. However, if your target is a CoolVDS instance in our European datacenter, latency should be under 20ms. A 10-second timeout is generous; if it takes that long, it's broken.
Part 2: The Historian (Munin)
Munin operates on a master/node architecture. The master polls the nodes every 5 minutes (via cron). With the release of Munin 2.0 a few months ago, we finally got native support for asynchronous updating, but for reliability on CentOS 6, I still stick to the 1.4 branch for production critical systems until 2.0 matures slightly.
Installing the Munin Node
On the server you want to monitor (the client), install the node. This requires the EPEL repository.
rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm
yum install munin-node
chkconfig munin-node on
Plugin Configuration: The Secret Sauce
Munin's power lies in its plugins. The default MySQL plugins are okay, but they often fail to capture the nuances of InnoDB. You need to edit /etc/munin/plugin-conf.d/munin-node to pass the correct credentials.
[mysql*]
env.mysqlopts -u munin --password=SecretPassword123
env.mysqladmin /usr/bin/mysqladmin
Do not run Munin as root against MySQL. Create a specific user with PROCESS and SUPER privileges only. Security matters.
Identifying "Noisy Neighbors" via Steal Time
This is where your choice of hosting provider becomes critical. In virtualized environments, "Steal Time" (%st) is the percentage of time your virtual CPU waits for the real CPU while the hypervisor is servicing another processor.
If you see high Steal Time on your graphs, your provider is overselling. This is common with budget OpenVZ hosts. At CoolVDS, we utilize KVM (Kernel-based Virtual Machine) with strict resource isolation. We don't oversell cores. When you look at a Munin CPU graph on our platform, 100% CPU means you are using it, not someone else.
Datacenter Note: When hosting in Norway or utilizing the NIX (Norwegian Internet Exchange), keep an eye on your latency graphs during peak streaming hours (usually 19:00 - 22:00 CET). If you see jitter, it’s network congestion. We peer directly at NIX to mitigate this, ensuring your packets don't take a detour through Sweden just to get to Oslo.
Integrating Custom Scripts
Sometimes you need to monitor something specific, like the queue size of a Postfix mail server or the number of files in a specific directory. Here is a simple Bash script formatted for Munin that monitors a specific directory count—useful for spotting cache build-ups.
#!/bin/bash
# /etc/munin/plugins/file_count
DIR="/var/www/html/var/cache"
if [ "$1" = "config" ]; then
echo "graph_title Cache Directory Count"
echo "graph_vlabel files"
echo "graph_category disk"
echo "count.label files"
echo "count.warning 5000"
echo "count.critical 10000"
exit 0
fi
ls -1 $DIR | wc -l | awk '{print "count.value " $1}'
Don't forget to make it executable and restart the node:
chmod +x /etc/munin/plugins/file_count
service munin-node restart
The Storage Bottleneck
In 2012, the single biggest bottleneck for web servers is disk I/O. Rotating rust (HDDs) simply cannot keep up with random read/write operations of modern CMSs like Magento or Drupal. You will see this in Munin under "Disk IOs per device." If you are consistently hitting the ceiling, no amount of caching will save you.
This is why we are aggressively moving towards SSD storage setups at CoolVDS. While traditional SAS 15k drives are reliable, the IOPS (Input/Output Operations Per Second) provided by Solid State Drives is a paradigm shift. If your `iowait` is consistently above 20%, it is time to migrate off legacy storage.
Conclusion: Compliance and Uptime
In Norway, the Personopplysningsloven (Personal Data Act) dictates strict requirements for data integrity. Unplanned downtime can be construed as a failure to secure data availability. By implementing Nagios for immediate alerts and Munin for long-term capacity planning, you demonstrate the due diligence required by the Datatilsynet.
Monitoring is not a "set it and forget it" task. It is a culture. But that culture requires a stable foundation. You cannot monitor your way out of bad hardware or a congested network.
Ready to see flat graphs and zero steal time? Stop fighting with noisy neighbors. Deploy a KVM instance on CoolVDS today and experience the difference that dedicated resources and low-latency peering can make for your infrastructure.