Console Login

Visualizing Infrastructure: Moving Beyond Nagios to Grafana 2.0

Visualizing Infrastructure: Moving Beyond Nagios to Grafana 2.0

If I have to look at another ugly static graph generated by Munin or Nagios, I might just `rm -rf /` my own workstation. It is 2015. We are deploying distributed systems, yet many sysadmins are still monitoring them like it's 1999.

The release of Grafana 2.0 in April changed the landscape. It finally introduced a backend (written in Go), meaning we get user authentication and server-side rendering. For those of us managing servers across Europe, particularly latency-sensitive nodes in Oslo, relying on `top` and `tail -f` isn't scalable. You need historical context to diagnose why your MySQL replication lag spiked at 03:00 AM.

Here is how to build a monitoring stack that actually works, using Collectd, InfluxDB, and Grafana.

The Storage Bottleneck: Why InfluxDB Needs IOPS

Before we touch config files, let's talk hardware. A Time Series Database (TSDB) like InfluxDB is a write-heavy beast. It ingests thousands of metrics per second. If you try to run this on a standard VPS with spinning rust (HDD) or network-attached storage with poor throughput, your monitoring stack will crash before your production app does.

I recently audited a setup where the metrics lag was 15 minutes. The bottleneck? Disk I/O wait. The CPU was sitting idle while the drive screamed for mercy.

Pro Tip: Never put your TSDB on shared hosting or standard HDDs. You need high random write performance. This is why we provision CoolVDS instances with pure SSD storage and KVM virtualization. The I/O isolation ensures your metrics ingest doesn't get blocked by a noisy neighbor.

The Stack: CentOS 7, InfluxDB, Grafana

We are using CentOS 7. It’s stable, it uses systemd (love it or hate it, it's here), and it's the standard for enterprise deployments.

1. Install InfluxDB (0.9.0)

InfluxDB 0.9 is currently in release candidate/beta, but the API changes from 0.8 are significant enough that you should start here. Don't build legacy debt.

wget https://s3.amazonaws.com/influxdb/influxdb-0.9.0-1.x86_64.rpm
sudo yum localinstall influxdb-0.9.0-1.x86_64.rpm
sudo systemctl start influxdb

2. Configure Collectd

Forget the InfluxDB agent for now; it's too new. Use the battle-tested collectd to gather system stats. It’s lightweight and written in C.

Edit /etc/collectd.conf to enable the network plugin. This pushes data to your InfluxDB instance (which supports the Collectd protocol natively).

LoadPlugin network
<Plugin network>
    Server "127.0.0.1" "25826"
</Plugin>

LoadPlugin cpu
LoadPlugin memory
LoadPlugin disk
LoadPlugin interface

3. Deploying Grafana 2.0

This is where the magic happens. Install the RPM:

sudo yum install https://grafanarel.s3.amazonaws.com/builds/grafana-2.0.2-1.x86_64.rpm
sudo systemctl start grafana-server

Navigate to port 3000. The default login is admin/admin. Change it immediately.

The Metric That Matters: CPU Steal

When you configure your dashboard, the first graph you build should not be "Total CPU Usage." It should be CPU Steal Time.

In a virtualized environment (VPS), `steal` (%st in `top`) measures the time your virtual CPU was ready to run instructions but had to wait for the physical hypervisor to give it time. If you see this number creep above 1-2%, your hosting provider is overselling their physical cores. You are fighting for scraps.

This is a plague in the cheap VPS market. You think you bought 4 Cores, but you really bought 4 tickets to a lottery where you rarely win.

At CoolVDS, we monitor the hypervisors strictly. If you pay for cores, you get the cycles. Our distinct lack of CPU steal is why developers migrate to us for latency-sensitive applications like VoIP or real-time bidding.

Data Sovereignty and Latency

For those operating out of Oslo, location matters. If your servers are in Frankfurt but your users are in Norway, you are adding 20-30ms of round-trip time unnecessarily.

Furthermore, with the Personopplysningsloven (Personal Data Act) being strictly enforced by Datatilsynet, keeping your logs and metrics—which often inadvertently contain IP addresses or user IDs—on Norwegian soil is a smart compliance move. Don't risk exporting data to US-based clouds if you don't have to.

Conclusion

Grafana 2.0 is not just eye candy; it is an essential diagnostic tool. But a dashboard is only as good as the data feeding it, and the infrastructure underneath it.

Don't let slow I/O kill your monitoring stack. Don't let CPU steal kill your application performance. If you want to test this stack, spin up a CoolVDS SSD instance. You can be fully configured in under 10 minutes.