Console Login

Surviving the Spike: A Battle-Hardened Guide to Application Performance Monitoring in 2016

Surviving the Spike: A Battle-Hardened Guide to Application Performance Monitoring

It is 3:00 AM on a Tuesday. Your Zabbix dashboard is lighting up like a Christmas tree, and your lead developer is swearing that the code is fine. "It works on my local machine," they say. Meanwhile, your response times from Oslo to Trondheim have spiked from 20ms to 2 seconds, and customers are bouncing.

Welcome to the reality of systems administration. If you cannot see inside the black box, you are just guessing. And in 2016, with traffic loads increasing exponentially, guessing gets you fired.

Most "monitoring" solutions sold today are just pretty dashboards that tell you the server is on fire after it has already burned down. Real Application Performance Monitoring (APM) isn't about staring at a green light; it is about forensic visibility into your stack. Today, we are going to look at how to monitor the things that actually matter: disk latency, CPU steal time, and application throughput.

The Silent Killer: Disk I/O Wait

I have seen more startups fail because of cheap storage than bad code. You can have a 16-core CPU, but if your disk queue is backed up, those cores are sitting idle waiting for data. This is iowait.

To diagnose this, standard top isn't enough. You need iostat, part of the sysstat package in CentOS 7 and Ubuntu 16.04.

# Install sysstat
apt-get install sysstat

# Watch extended statistics every 1 second
iostat -x 1

Pay close attention to the %util and await columns.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           14.50    0.00    3.20   45.10    0.00   37.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     4.00    0.00  156.00     0.00  4512.00    57.85     2.10   14.20    0.00   14.20   6.35  99.10

If your %util is near 100% and your await is climbing, your spinning rust (HDD) or cheap SATA SSD is the bottleneck. In the example above, 45.10% iowait means the CPU is wasting nearly half its time just waiting for the disk.

Pro Tip: This is why hardware selection is not a commodity. At CoolVDS, we have standardized on NVMe storage for our high-performance tiers. The IOPS difference between SATA and NVMe isn't just a number; it is the difference between your database locking up during a backup and staying responsive. If your provider is still selling you "Standard SSD" for database workloads in 2016, you are paying for latency.

Nginx & PHP-FPM: exposing the metrics

Stop grepping logs to see how many connections you have. Both Nginx and PHP-FPM have built-in status pages that are extremely lightweight. They are disabled by default for security, so let's turn them on.

1. Nginx Stub Status

Add this to your nginx.conf inside a server block restricted to localhost or your monitoring IP:

location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

Testing it with curl:

$ curl http://127.0.0.1/nginx_status
Active connections: 291 
server accepts handled requests
 16630948 16630948 31070465 
Reading: 6 Writing: 179 Waiting: 106

2. PHP-FPM Status

Edit your pool configuration (usually /etc/php/7.0/fpm/pool.d/www.conf):

pm.status_path = /status

Now you can query exactly what your PHP workers are doing. This is critical for sizing your pm.max_children setting. If you see "Active processes" hitting your limit constantly, you are dropping requests.

Centralized Logging: The ELK Stack (2016 Edition)

SSHing into five different servers to check logs is amateur hour. With the release of Elasticsearch 5.0 recently (October 2016), the ELK stack (Elasticsearch, Logstash, Kibana) has matured significantly. However, for stability, many of us are still running the 2.4.x branch.

The goal is to ship Nginx access logs to a central place so we can visualize 500 errors and slow requests. Here is a battle-tested Logstash configuration input for Nginx:

input {
  file {
    path => "/var/log/nginx/access.log"
    type => "nginx_access"
    start_position => "beginning"
  }
}

filter {
  if [type] == "nginx_access" {
    grok {
      match => { "message" => "%{COMBINEDAPACHELOG}" }
    }
    geoip {
      source => "clientip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "nginx-%{+YYYY.MM.dd}"
  }
}

With this setup, you can build a Kibana dashboard showing exactly which endpoints are throwing 500 errors. Combined with GeoIP, you can see if that DDoS attack is coming from a specific region and block it at the firewall level.

The "Noisy Neighbor" Problem & CPU Steal

Here is the dirty secret of the VPS industry: Overcommitment. Providers stack 500 tenants on a single host, banking on the fact that not everyone will use their CPU at once. But when they do, you suffer.

Run top and look at the %st (steal) value.

%Cpu(s):  12.1 us,  4.2 sy,  0.0 ni, 75.0 id,  0.2 wa,  0.0 hi,  0.1 si,  8.4 st

If %st is consistently above 0%, your hypervisor is throttling you. You are paying for a CPU cycle that the host is giving to someone else. This introduces jitter into your application that no amount of code optimization will fix.

This is why at CoolVDS, we use KVM (Kernel-based Virtual Machine) with strict resource isolation constraints. We don't play the overcommitment game. When you buy 4 vCPUs, you get the cycles of 4 vCPUs. In a market where latency to NIX (Norwegian Internet Exchange) needs to be under 5ms, CPU steal is unacceptable.

Compliance and Data Sovereignty

With the Datatilsynet (Norwegian Data Protection Authority) tightening regulations and the new EU General Data Protection Regulation (GDPR) looming on the horizon for 2018, where you monitor your data is as important as how you monitor it. Sending your log files—which contain IP addresses and user agents—to a US-based SaaS monitoring platform essentially voids your privacy compliance safe harbor.

Self-hosting your monitoring stack (like ELK or Zabbix) on a Norwegian VPS ensures that your customer data never leaves Norwegian legal jurisdiction. It is not just about performance; it is about risk management.

The Final Word

Performance monitoring in 2016 is about peeling back the layers of abstraction. It requires looking at the kernel, the disk subsystem, and the network topology. Don't rely on "cloud magic." Verify your resources.

If you are tired of debugging latency issues caused by your hosting provider, it is time to switch to infrastructure that respects your engineering efforts. Don't let slow I/O kill your SEO.

Ready for consistent performance? Deploy a high-performance KVM instance on CoolVDS in 55 seconds and see the difference a dedicated resource pool makes.