Console Login
Home / Blog / SysAdmin / Sleep Through the Night: Bulletproof Server Monitoring with Munin and Nagios on CentOS 5
SysAdmin 1 views

Sleep Through the Night: Bulletproof Server Monitoring with Munin and Nagios on CentOS 5

@

The Silence Before the Crash

It’s 3:17 AM. Your phone buzzes. It's not a text from a friend; it's an angry client asking why the webshop is returning a 502 Bad Gateway. You stumble to your laptop, SSH in, and find the server load is at 50.0. Apache is deadlocked. The database is crying.

If you had proper monitoring, you would have seen the warning signs three days ago.

In the world of systems administration, silence isn't golden—it's suspicious. Today, we are going to fix this using the two most reliable tools in the open-source arsenal: Nagios for immediate alerts and Munin for historical trending. Whether you are running a single VPS or a cluster of dedicated servers, this setup is non-negotiable.

The Philosophy: State vs. Trend

Many junior admins confuse the two. Here is the distinction:

  • Nagios answers the question: "Is it broken right now?" It is binary. It wakes you up.
  • Munin answers the question: "When did it start getting slow?" It draws graphs. It helps you diagnose the root cause.

You need both. Nagios tells you the disk is full; Munin shows you the graph of the disk filling up over the last week so you can catch it before it hits 100% next time.

Part 1: The Watchdog (Nagios 3 on CentOS 5)

Nagios 3 is the industry standard for a reason. It is ugly, complex to configure, and absolutely rock solid. While newer tools try to be flashy, Nagios just works.

First, install the necessary packages. I prefer using the EPEL repository for CentOS 5, as compiling from source is a waste of billable hours.

yum install nagios nagios-plugins-all nagios-plugins-nrpe

Configuring the Contacts

The biggest mistake I see is alerting the wrong people. Open /etc/nagios/objects/contacts.cfg. Do not send critical alerts to a generic 'info@' email that nobody checks until Monday morning.

define contact{ contact_name sysadmin_on_call use generic-contact alias Battle Hardened Admin email [email protected] service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,u,r }
Pro Tip: Use the check_nrpe plugin to execute checks locally on remote servers. Checking a port from the outside tells you the firewall is open. Checking the process table locally tells you if MySQL is actually running or just a zombie process.

Part 2: The Historian (Munin)

Munin uses RRDTool (Round Robin Database) to store data. It is fantastic for spotting memory leaks or slow I/O degradation. Installing the node on your target server is straightforward:

yum install munin-node chkconfig munin-node on service munin-node start

Then, on your master server, add the node to /etc/munin/munin.conf:

[db.coolvds.no] address 10.0.0.5 use_node_name yes

The Hidden I/O Killer

Here is where cheap hosting providers fail you. Munin generates a lot of small write operations. Every 5 minutes, it updates hundreds of .rrd files. On a standard VPS with oversold storage, this creates "iowait". Your monitoring tool ends up slowing down the very server it is supposed to watch.

We see this constantly with clients migrating to us. They try to run Munin on a budget VPS and the graph lines start breaking because the disk creates too much latency.

The CoolVDS Advantage: Hardware Matters

Software configuration can only save you so much. If the underlying spindles are slow, your database locks up. At CoolVDS, we don't play the "overselling" game common in the budget market.

Our infrastructure uses enterprise-grade 15k RPM SAS RAID-10 arrays. Unlike standard SATA drives used by budget hosts, 15k SAS drives offer vastly superior random I/O performance. This means you can run intensive RRDTool updates for Munin alongside your high-traffic MySQL database without the disk queue spiking.

Norwegian Reliability

For our clients in Oslo and the greater Nordic region, latency is king. Hosting your monitoring server outside the country introduces network jitter that leads to false positives in Nagios. By placing your infrastructure in our Oslo datacenter, connected directly to NIX (Norwegian Internet Exchange), you ensure that an alert is a real problem, not just a hiccup in a trans-Atlantic fiber cable.

Furthermore, keeping your data within Norway ensures compliance with the Personal Data Act (Personopplysningsloven). Even server logs contain IP addresses, which Datatilsynet considers personal data. Don't risk it by hosting on a budget box in Texas.

Final Configuration Checks

Before you close your SSH session, verify your firewall allows the monitoring server to talk to the nodes. In iptables, you need to allow port 5666 (NRPE) and 4949 (Munin) only from your monitoring IP.

-A RH-Firewall-1-INPUT -s 192.168.1.10 -p tcp -m state --state NEW -m tcp --dport 5666 -j ACCEPT -A RH-Firewall-1-INPUT -s 192.168.1.10 -p tcp -m state --state NEW -m tcp --dport 4949 -j ACCEPT

Monitoring is the difference between a professional and an amateur. It gives you the confidence to deploy on a Friday (though we still don't recommend that).

Need a rock-solid foundation for your monitoring stack? Deploy a CoolVDS Xen instance today. With our 15k SAS storage and gigabit uplink to NIX, you’ll never miss a heartbeat.

/// TAGS

/// RELATED POSTS

Surviving the Slashdot Effect: HAProxy Load Balancing on CentOS 5

Is your Apache server ready for the Digg front page? Learn how to deploy HAProxy 1.3 to split traffi...

Read More →

RAID Is Not A Backup: The 2009 Guide to Automated Disaster Recovery in Norway

RAID 10 won't save you from rm -rf. Learn the battle-tested scripts, remote sync strategies, and Nor...

Read More →

Surviving the Slashdot Effect: robust Load Balancing with HAProxy on Linux

Is your single LAMP server choking on traffic? Stop upgrading RAM and start scaling horizontally. He...

Read More →

Surviving the Slashdot Effect: Bulletproof Load Balancing with HAProxy on Linux

Is your Apache server choking on traffic? Learn how to implement software-based load balancing using...

Read More →

Stop Trusting JavaScript: Server-Side Log Analysis with AWStats on CentOS 5

Client-side trackers lie. Real sysadmins use raw logs. A deep dive into configuring AWStats on Apach...

Read More →

Building a Fortified Mail Server in 2009: Postfix, Dovecot, and Surviving the Spam Filters

Stop letting shared hosting IPs blacklist your business. We break down a battle-tested Postfix/Dovec...

Read More →
← Back to All Posts