Sleep Through the Night: The Ultimate Guide to Munin and Nagios Monitoring on CentOS 5
There is nothing quite like the sound of an SMS alert at 3:00 AM. It’s the sound of failure. If you are waking up to fix a crashed MySQL table or a kernel panic, you aren't doing DevOps—you are doing damage control. The difference between a junior admin and a senior architect is visibility.
In the Norwegian hosting market, where reliability is mandated not just by SLAs but often by the strict standards of Datatilsynet (The Data Inspectorate) regarding data availability, flying blind is not an option. You need to know two things: Is it up? and How is it performing?
That is where the holy trinity of 2010 infrastructure comes in: Nagios for the "Is it up?" alerts, Munin for the "Why is it slow?" graphs, and a rock-solid platform like CoolVDS to run it on.
The Philosophy: Alerting vs. Trending
I recently audited a setup for a client in Oslo running a high-traffic Magento store. They were experiencing intermittent 502 Bad Gateway errors on Nginx. Their solution? A cron job that restarted PHP-FPM every hour. It was barbaric. They had no idea that their RAM usage was creeping up by 50MB every ten minutes due to a memory leak in a custom extension.
We installed Munin. The graph showed a perfect "sawtooth" memory pattern. We identified the leak, patched the code, and stability returned. This is why you need both tools:
- Nagios is binary. It cares if a service is OK, WARNING, or CRITICAL. It screams at you when things break.
- Munin is analog. It paints the history. It tells you that your disk I/O wait has been increasing by 2% daily for the last week.
Prerequisites and Environment
For this guide, we are assuming a standard CentOS 5.5 environment (x86_64). While Debian Lenny is solid, RHEL/CentOS remains the standard for enterprise deployments in the Nordics. You will need root access. If you are on a CoolVDS instance, you have full root control and a clean kernel, which is critical for accurate stats.
Pro Tip: Virtualization Matters
Be careful running monitoring tools on budget OpenVZ containers. Because OpenVZ shares the host kernel, tools like `vmstat` or `free` often report the host node's resources, not your allocated limits. This leads to false positives. At CoolVDS, we use Xen HVM and KVM technology, providing complete hardware isolation. When Munin says you are out of swap, you are actually out of swap.
Step 1: Installing the EPEL Repository
CentOS base repositories are conservative. To get modern versions of Nagios (3.x) and Munin (1.4.x), we need the Extra Packages for Enterprise Linux (EPEL) repository.
rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
yum update -y
Step 2: Deploying Nagios Core
Install Nagios and the standard plugins. The plugins are the scripts that actually do the checking (ping, http, disk usage).
yum install nagios nagios-plugins-all nagios-plugins-nrpe
chkconfig nagios on
chkconfig httpd on
Nagios is configured via object definitions. We need to set up a contact to receive those precious alerts. Edit /etc/nagios/objects/contacts.cfg:
define contact{
contact_name sysadmin
use generic-contact
alias Operations Team
email alerts@yourdomain.no
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
}
Before you restart, always verify your configuration. A syntax error here prevents the daemon from starting:
nagios -v /etc/nagios/nagios.cfg
If you see Total Warnings: 0, Total Errors: 0, you are safe to launch.
service nagios start
service httpd start
Step 3: configuring Munin for Trend Analysis
Munin uses a master/node architecture. The "node" runs on the servers being monitored, and the "master" gathers the data to generate RRDtool graphs. On a single server, you install both.
yum install munin munin-node
chkconfig munin-node on
The Node Configuration
By default, the node listens on port 4949. Security is paramount; you do not want competitors querying your load averages. In /etc/munin/munin-node.conf, ensure you only allow the master IP (localhost in this case):
# /etc/munin/munin-node.conf
log_level 4
log_file /var/log/munin/munin-node.log
pid_file /var/run/munin/munin-node.pid
background 1
setsid 1
user root
group root
# Regex to allow localhost
allow ^127\.0\.0\.1$
Start the node agent:
service munin-node start
The beauty of Munin is the plugin ecosystem. It auto-detects what you have installed. If you install MySQL later, simply run:
munin-node-configure --shell | sh
This command scans your system, finds MySQL, Apache, or Postfix, and creates the necessary symlinks in /etc/munin/plugins/ automatically.
Step 4: Nginx & Apache Stub Status
To get the most out of web server monitoring, you need to expose internal metrics. For Nginx, enabling the HttpStubStatusModule is essential for tracking active connections. Add this to your nginx.conf inside a server block restricted to localhost:
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
Once reloaded, Munin can graph Active Connections, Reading, and Writing states. This is the difference between guessing why the server is slow and knowing that your Keep-Alive timeout is too high.
The Hardware Reality: I/O Wait
One of the most common alerts you will see in Nagios is CPU Load. However, high load doesn't always mean the CPU is busy calculating. In a virtualized environment, it often means I/O Wait—the CPU is sitting idle waiting for the hard disk to write data.
Standard VPS hosting often relies on SATA drives in RAID 10. While reliable, random write performance (IOPS) hits a ceiling quickly. If you see high "iowait" on your Munin graphs during backups or database imports, your storage is the bottleneck.
This is why CoolVDS invests heavily in 15k SAS drives and Enterprise SSD caching. While full SSD arrays are still prohibitively expensive for mass storage in 2010, our hybrid caching tier significantly reduces I/O latency. Low latency is crucial for users connecting via NIX (Norwegian Internet Exchange) in Oslo. You want the physical distance to be the only latency factor, not the disk arm seeking time.
Advanced Integration: NSCA
For the truly paranoid, you want your servers to report back to a central monitoring server even if they are behind a firewall. Use NSCA (Nagios Service Check Acceptor) to push passive checks.
This is particularly useful for backup scripts. Instead of Nagios checking if a backup is done, the backup script itself sends a success code to Nagios upon completion.
# Example bash snippet for backup script
if [ $? -eq 0 ]; then
printf "%s\t%s\t%s\t%s\n" "$HOSTNAME" "Backup" "0" "Success" | /usr/sbin/send_nsca -H monitor.coolvds.com -c /etc/nagios/send_nsca.cfg
else
printf "%s\t%s\t%s\t%s\n" "$HOSTNAME" "Backup" "2" "Failed" | /usr/sbin/send_nsca -H monitor.coolvds.com -c /etc/nagios/send_nsca.cfg
fi
Conclusion
Monitoring is not an afterthought; it is the foundation of a stable infrastructure. By combining the immediate alerting of Nagios with the historical trending of Munin, you gain total situational awareness.
However, monitoring a slow server only tells you it's slow. It doesn't fix the underlying hardware constraints. If your graphs are consistently showing high I/O wait or CPU steal time, it might be time to migrate to a platform that respects your need for dedicated resources.
Ready to stop fighting fires? Deploy a high-performance, KVM-based instance on CoolVDS today. Our infrastructure is tuned for the Nordic market, ensuring low latency and high availability for your critical services.