Console Login
Home / Blog / Server Administration / Sleep Through the Night: Bulletproof Server Monitoring with Nagios & Munin
Server Administration 8 views

Sleep Through the Night: Bulletproof Server Monitoring with Nagios & Munin

@

Zero Downtime: Mastering Munin and Nagios for High-Traffic Infrastructure

It’s 3:42 AM. Your phone buzzes. It’s not a text from a friend; it’s your server screaming. Your MySQL process just died, your load average is 50.0, and you have no idea why. If you are running production workloads without granular monitoring, you aren't a SysAdmin; you're a gambler.

In the hosting world—especially here in Norway where reliability is expected—uptime is the only currency that matters. I’ve seen too many developers deploy code to a "black box" server, only to panic when traffic spikes. Today, we are going to fix that. We are building a monitoring stack that tells you before the crash happens using the industry standards: Nagios for alerting and Munin for resource graphing.

The Stack: Why Nagios and Munin?

You might ask, "Why two tools?" Because they solve two different problems.

  • Nagios is binary. Is the web server up? Yes/No. Is disk space under 90%? Yes/No. If "No," it pages you.
  • Munin is historical. It answers: "Why did the server crash at 4:00 AM?" by showing you a graph of memory usage creeping up since Tuesday.

Step 1: The Setup (CentOS 5/6)

We assume you are running a clean VPS Norway instance. I recommend CoolVDS for this because their upstream connectivity to NIX (Norwegian Internet Exchange) ensures your monitoring alerts aren't delayed by network jitter. If you are on a budget host with high latency, you will get false positives.

First, we need the EPEL repository, as standard CentOS repos don't carry these tools.

rpm -Uvh http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
yum update -y
yum install nagios nagios-plugins-all munin munin-node httpd

Step 2: Configuring Nagios for Real Alerts

The default config is useless. You need to define who gets yelled at. Open the contacts configuration:

vi /etc/nagios/objects/contacts.cfg

Look for the email directive. Do not just put your work email here. If your mail server is on the same box that crashes, you won't get the alert. Use an external provider or an SMS gateway email alias.

define contact{
        contact_name                    sysadmin
        use                             generic-contact
        alias                           Battle Hardened Admin
        email                           [email protected]
        }

Next, verify your configuration and start the service. Don't forget chkconfig or it won't survive a reboot.

/usr/bin/nagios -v /etc/nagios/nagios.cfg
service nagios start
service httpd start
chkconfig nagios on
chkconfig httpd on

Step 3: Munin for the "Why"

Nagios tells you the house is on fire; Munin shows you who lit the match. The magic of Munin is the plugins. It auto-detects what you are running (Apache, MySQL, Exim) and generates graphs.

Edit /etc/munin/munin-node.conf to allow your monitoring master to pull data. If you are running a single box, allow ^127\.0\.0\.1$ is fine. If you have a cluster, add the master IP.

Pro Tip: Pay close attention to the "Disk I/O" and "MySQL Slow Queries" graphs. On shared hosting, high I/O wait usually means a noisy neighbor. This is why we migrated our core databases to CoolVDS; their Xen-based isolation and high-performance RAID arrays mean 0% steal time, even during peak hours.

The "Steal Time" Trap

One metric most admins ignore is CPU Steal Time. In your Munin CPU graph, if you see a large "purple" area (steal), it means the hypervisor is starving your VM of cycles. This is common with oversold budget providers.

If your monitoring shows high steal time, no amount of my.cnf tuning will save you. You need better hardware. CoolVDS guarantees dedicated CPU cycles, so your monitoring reflects your actual load, not the load of the teenager running a Minecraft server on the same node.

Compliance and Data Location

Working in the Nordic market, we have strict requirements regarding the Personal Data Act (Personopplysningsloven). While Safe Harbor exists for US transfers, the Norwegian Data Inspectorate (Datatilsynet) looks favorably on keeping data within national borders. By hosting your monitoring data—which often contains sensitive IP addresses and server names—on servers physically located in Oslo, you simplify your compliance posture significantly.

Final Thoughts

Monitoring is not optional. It is the difference between a minor hiccup and a business-ending outage. Set up Nagios today to catch the failures, and let Munin run for a week to establish your baseline.

If you are tired of debugging slow performance only to find out it's your host's slow disks, it's time to move. Deploy a test instance on CoolVDS; their low latency network and solid I/O performance make them the reference platform for serious SysAdmins in 2011.

/// TAGS

/// RELATED POSTS

Surviving the Spike: High-Performance E-commerce Hosting Architecture for 2012

Is your Magento store ready for the holiday rush? We break down the Nginx, Varnish, and SSD tuning s...

Read More →

Automate or Die: Bulletproof Remote Backups with Rsync on CentOS 6

RAID is not a backup. Don't let a typo destroy your database. Learn how to set up automated, increme...

Read More →

Nginx as a Reverse Proxy: Stop Letting Apache Kill Your Server Load

Is your LAMP stack choking on traffic? Learn how to deploy Nginx as a high-performance reverse proxy...

Read More →

Apache vs Lighttpd in 2012: Squeezing Performance from Your Norway VPS

Is Apache's memory bloat killing your server? We benchmark the industry standard against the lightwe...

Read More →

Stop Guessing: Precision Server Monitoring with Munin & Nagios on CentOS 6

Is your server going down at 3 AM? Stop reactive fire-fighting. We detail the exact Nagios and Munin...

Read More →

The Sysadmin’s Guide to Bulletproof Automated Backups (2012 Edition)

RAID 10 is not a backup strategy. In this guide, we cover scripting rsync, rotating MySQL dumps, and...

Read More →
← Back to All Posts