Console Login

Stop Guessing: The Sysadmin’s Guide to Real Application Performance Monitoring in 2013

Stop Guessing: The Sysadmin’s Guide to Real Application Performance Monitoring

It is 3:00 AM. The pager is screaming. The client, a large e-commerce retailer based in Oslo, is calling to say their checkout is hanging. You SSH in. The load average is 15.0. Memory is free. What do you do?

Most "admins" will just restart Apache and pray. That is not engineering; that is surrender. If you want to survive in this industry, you need to stop guessing and start looking at what the kernel is actually telling you. In the world of high-traffic hosting, specifically here in Norway where latency to the NIX (Norwegian Internet Exchange) is scrutinized down to the millisecond, ignorance is expensive.

We are going to look at how to diagnose bottlenecks properly using tools that are already on your server right now. No expensive SaaS agents required—just raw Linux competence.

1. The Lie of "Load Average"

We have all seen it. You type uptime and see a load of 10.0. You panic. But load average is just the number of processes in the run queue or waiting for disk I/O. A load of 10 on a 12-core box is fine. A load of 2 on a single-core VPS is a disaster.

The real question is: Why are they waiting?

To find out, we need to look at CPU states. Install sysstat if you haven't already (apt-get install sysstat on Ubuntu 12.04 LTS).

vmstat 1

Watch the columns under cpu specifically:

  • us (user): Your code is running (PHP, Python, etc). High is usually good—it means work is getting done.
  • sy (system): The kernel is working. If this is above 20%, you have a problem. Context switching, driver issues, or bad iptables rules.
  • wa (wait): The killer. The CPU is idle, waiting for the disk to read/write data.
Pro Tip: If your wa is consistently over 10-15%, your storage is the bottleneck. This is common on budget VPS providers who oversell spinning HDDs. This is why at CoolVDS we strictly provision high-performance SSD storage in RAID10 for our KVM instances. We don't believe in making your CPU wait for a spinning platter.

2. The "Steal Time" Trap (%st)

This is the metric most people ignore until it ruins their week. Run top and look at the CPU line. See %st?

Steal Time is the percentage of time your virtual CPU was ready to run, but the hypervisor (the physical server) didn't give it CPU cycles. This happens when your hosting provider crams too many customers onto one physical node. It is the "noisy neighbor" effect.

If you see %st climbing above 5-10%, there is no config tweak in my.cnf that will save you. You are fighting for oxygen. The only fix is to move to a provider that guarantees resources. We architecture CoolVDS on KVM (Kernel-based Virtual Machine) with strict resource isolation to ensure that when you buy a core, you get a core.

3. Application Metrics: Exposing the Internals

System metrics tell you that you are slow. Application metrics tell you why. Standard installs of Nginx and PHP-FPM hide this data by default. Let's fix that.

Nginx Stub Status

Inside your nginx.conf or vhost definition, add a location block to expose the status page. Restrict it to your IP for security.

server {
    listen 80;
    server_name localhost;

    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

Test it with curl http://127.0.0.1/nginx_status. You will see active connections, reading, and writing states. If "reading" is high, you might have slow clients (mobile networks) or KeepAlive issues.

PHP-FPM Status

PHP is usually the culprit. Edit your pool config (usually in /etc/php5/fpm/pool.d/www.conf):

pm.status_path = /status

Now you can monitor the active processes. If you see your "active processes" hitting your pm.max_children limit, your site will 502 Bad Gateway. You either need to optimize your code or increase RAM.

Metric Command/Tool Danger Zone
Disk I/O Wait vmstat 1, iotop > 20% wa
CPU Steal top > 5% st
MySQL Slow Queries mysqldumpslow Any consistent locking

4. When in Doubt: Strace It

Sometimes the logs are empty. The process is running but stuck. Enter strace. It intercepts system calls. It is verbose, it is messy, and it is beautiful.

Find the PID of your stuck PHP process:

ps aux | grep php

Attach strace to it:

strace -p 12345 -s 100

You might see it hanging on a connect() call to an external API that is down, or a lstat() loop on a massive directory. I once debugged a Magento install that was taking 10 seconds to load. Strace showed it was trying to stat a cache directory with 400,000 files in it. We moved the cache to Redis, and the load time dropped to 300ms.

5. The Norwegian Context: Data & Latency

For those of us operating out of Oslo or serving Nordic clients, physical location matters. The Norwegian Personal Data Act (Personopplysningsloven) places strict requirements on how we handle data. While Safe Harbor exists for US transfers, the Datatilsynet (Data Inspectorate) is increasingly vigilant about where data physically sits.

Hosting outside of the EEA introduces legal headaches you do not need. Furthermore, latency to the NIX in Oslo is a physical reality. Hosting in a cheap datacenter in Texas adds 120ms+ of round-trip time. For a dynamic application with many database calls, that latency compounds. A page needing 50 DB queries just added 6 seconds of load time purely due to the speed of light.

This is why CoolVDS infrastructure is located regionally. Low latency isn't just a "nice to have"; for database-heavy applications, it is a mathematical necessity.

Conclusion

Monitoring isn't about pretty graphs to put on a dashboard in the office; it is about knowing exactly what your server is doing at 3:00 AM. Stop guessing with restart scripts. Look at wa, check %st, and trace your processes.

And if you find that your debugging leads to the conclusion that your current host is stealing your CPU cycles or choking your I/O on old spinning disks, it might be time to move. Deploy a test instance on CoolVDS today—our SSD-backed KVM architecture is built for admins who know the difference.