Console Login

Surviving the Spike: A DevOps Guide to Infrastructure Monitoring in 2015

Surviving the Spike: A DevOps Guide to Infrastructure Monitoring in 2015

It is 3:00 AM. Your phone buzzes on the nightstand. It is not a text from a friend; it is PagerDuty screaming that the main database is unresponsive. You open your laptop, squinting at the screen, and see that the server is technically "up"—ping responds in 40ms. But the application is dead. SSH hangs. The load average is 400.

If you have been in this industry long enough, you know this scenario. It is the result of "green light" monitoring—dashboards that look healthy right up until the moment the cliff edge crumbles. As we approach 2015, the complexity of our stacks in Norway is outgrowing the old-school Nagios checks we grew up on.

I have spent the last decade debugging servers from Oslo to Bergen, and the lesson is always the same: Availability is not binary. Here is how we build monitoring architectures that actually warn you before the meltdown, using tools available right now in late 2014.

The Lie of "Free" Resources: Understanding Steal Time

Most budget VPS providers in Europe stack hundreds of clients onto a single physical node using OpenVZ. They promise you 4 cores, but when your neighbor starts compiling a kernel or rendering video, your performance tanks. The metric that exposes this is %st (steal time).

If you are running critical infrastructure, you need to be watching this value like a hawk. High steal time means the hypervisor is throttling you.

# Run this to check for noisy neighbors
top - 14:23:45 up 12 days,  4:12,  1 user,  load average: 0.15, 0.08, 0.06
Tasks:  89 total,   1 running,  88 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.3 us,  1.0 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  0.0 si,  1.0 st

See that 1.0 st at the end? That is acceptable. If that number jumps to 10% or 20%, your CPU is waiting for the physical host to give it cycles. This is why for production workloads, we strictly recommend KVM virtualization, like the architecture used at CoolVDS. With KVM, resources are harder to overcommit, and kernel isolation prevents a neighbor's memory leak from crashing your party.

Beyond Ping: The ELK Stack Revolution

We are seeing a massive shift this year away from simply grepping logs on production servers (please stop doing this) toward centralized logging. The ELK Stack (Elasticsearch, Logstash, Kibana) has matured significantly with the recent Elasticsearch 1.4 release.

Instead of guessing why an error rate spiked, you can visualize it. But setting it up requires care. Java Heap size is the usual suspect for crashes here.

Configuring Logstash for Nginx Access Logs

To get meaningful data, you need to parse your logs, not just store them. Here is a standard grok pattern I use for Nginx logs in /etc/logstash/conf.d/nginx.conf:

input {
  file {
    path => "/var/log/nginx/access.log"
    type => "nginx_access"
  }
}

filter {
  if [type] == "nginx_access" {
    grok {
      match => { "message" => "%{IPORHOST:clientip} - %{USERNAME:remote_user} \[%{HTTPDATE:timestamp}\] \"%{WORD:verb} %{URIPATHPARAM:request} HTTP/%{NUMBER:httpversion}\" %{NUMBER:response} %{NUMBER:bytes} \"%{DATA:referrer}\" \"%{DATA:agent}\"" }
    }
    geoip {
      source => "clientip"
    }
  }
}

output {
  elasticsearch { host => "localhost" }
}
Pro Tip: Be careful with the geoip filter if you have high traffic. It adds CPU overhead. Ensure your monitoring server has dedicated RAM. CoolVDS offers instances with high-memory ratios specifically for Java-heavy stacks like Elasticsearch.

Graphing the Pulse: Graphite & Grafana

RRDtool (used by Munin and Cacti) is reliable, but it is ugly and slow to configure. The future—and what I am deploying for clients right now—is Graphite paired with the new frontend Grafana (which released version 1.9 recently). It allows us to render beautiful, real-time dashboards that clients actually understand.

The power of Graphite is that you can feed it metrics from anywhere using a simple shell script and `netcat`. You do not need complex agents.

Example: Pushing Custom MySQL Metrics

Want to track the number of active threads in MySQL every 10 seconds? You don't need a plugin. You need bash.

#!/bin/bash
# simplistic_monitor.sh

GRAPHITE_HOST="monitor.yourdomain.no"
GRAPHITE_PORT=2003
METRIC_PATH="servers.oslo-db-01.mysql.threads_running"

while true; do
  VALUE=$(mysql -u monitor -p'secret' -e "SHOW GLOBAL STATUS LIKE 'Threads_running'" -N | awk '{print $2}')
  TIMESTAMP=$(date +%s)
  
  echo "$METRIC_PATH $VALUE $TIMESTAMP" | nc -q0 $GRAPHITE_HOST $GRAPHITE_PORT
  
  sleep 10
done

This level of granularity is essential when you are debugging intermittent lock contention that only happens during peak Norwegian shopping hours (18:00 - 21:00).

The Disk I/O Bottleneck

In 2014, the biggest bottleneck is rarely CPU; it is Disk I/O. If you are running a Magento store or a heavy Drupal site, your database is likely thrashing the disk. We monitor iowait religiously.

If you see %wa (iowait) consistently above 10-15%, your storage is too slow. This is where the hardware choice matters. Spinning rust (HDDs) cannot keep up with modern database random reads.

We recently migrated a client from a legacy host to CoolVDS's SSD-backed platform. Their iowait dropped from 25% to near zero, and page load times went from 3.2s to 0.8s. No code changes, just better physics.

Data Sovereignty and The Personal Data Act

We cannot talk about infrastructure in Norway without mentioning compliance. The Personal Data Act (Personopplysningsloven) and the guidelines from Datatilsynet are clear about how we handle personal data.

When you centralize logs (as mentioned with ELK above), you are moving user IP addresses and potentially PII (Personally Identifiable Information) across servers. If you host on clouds that silently replicate data to US servers, you are walking a legal minefield regarding Safe Harbor.

Keeping your monitoring stack and your hosting on Norwegian soil—or at least strictly within the EEA with a provider like CoolVDS—simplifies your compliance posture significantly. You know exactly where the drives are.

Conclusion: Architect for Failure

Systems fail. Hard drives die. Network switches glitch. The goal of monitoring is not to prevent failure, but to reduce the Mean Time To Recovery (MTTR).

  1. Stop using OpenVZ for critical loads; switch to KVM.
  2. Implement centralized logging (ELK) so you aren't blind when a server becomes unreachable.
  3. Graph metrics (Graphite/Grafana) to spot trends before they become outages.

If you are tired of wondering why your server is slow, it might be time to stop fighting your provider's noisy neighbors. Deploy a KVM instance on CoolVDS today, install htop, and enjoy seeing 0.0 st.