Console Login

Stop Guessing, Start Measuring: A SysAdmin's Guide to APM in 2014

Stop Guessing, Start Measuring: A SysAdmin's Guide to APM

"The site feels sluggish."

This is the ticket that ruins my weekend. It is vague. It is subjective. And usually, the client is testing on a 3G connection in a cabin somewhere in Finnmark while blaming your infrastructure. But sometimes, they are right. The server is up, load average is low, yet the Time To First Byte (TTFB) is crawling.

If you are still relying solely on Nagios to ping your server every 5 minutes, you are flying blind. In 2014, "uptime" is a vanity metric. Performance is the only metric that matters.

I have spent the last week debugging a Magento installation that was bringing a quad-core server to its knees. The culprit wasn't traffic; it was bad I/O and a single unindexed MySQL query. Here is how to spot the bottlenecks before your client does, using tools available right now on your Linux VPS.

1. The Foundation: Nginx Stub Status

Most of us have migrated from Apache to Nginx by now. If you haven't, you should. The event-driven architecture handles high concurrency far better than Apache's prefork worker model. But are you watching it?

Nginx has a built-in module called ngx_http_stub_status_module. It is lightweight and gives you real-time data on active connections. On a CoolVDS instance (which usually ships with Nginx pre-compiled with this flag), you just need to enable it.

Add this to your nginx.conf inside a server block restricted to localhost or your VPN IP:

location /nginx_status {
    stub_status on;
    access_log off;
    allow 127.0.0.1;
    deny all;
}

Reload the service:

service nginx reload

Now, a simple curl localhost/nginx_status tells you the truth. If you see "Writing" states spiking while "Waiting" drops, your backend (PHP-FPM) is choking, not the web server.

2. PHP-FPM Slow Logs: The Smoking Gun

If Nginx is waiting, PHP is thinking. Or crashing. Or sleeping.

You don't need expensive SaaS tools to find slow scripts. PHP-FPM has a built-in profiler that is criminally underused. Open your pool configuration (usually /etc/php5/fpm/pool.d/www.conf on Ubuntu 14.04) and uncomment these lines:

request_slowlog_timeout = 5s
slowlog = /var/log/php5-fpm.log.slow

Pro Tip: Set the timeout to 5 seconds initially. If you set it to 1s on a high-traffic site, you will fill your disk faster than you can say "kernel panic."

When a script exceeds this limit, PHP dumps the stack trace to the log. You will see exactly which function—usually a curl_exec to an external API or a heavy mysql_query—is holding up the thread. This is how I found out a client's WordPress site was timing out because it was trying to ping a dead XML-RPC server in Russia on every page load.

3. The Database: Where Performance Dies

90% of performance issues are the database. Period. Developers love ORMs because they are easy, but ORMs love to generate terrible SQL queries.

Enable the slow query log in MySQL. In your my.cnf:

[mysqld]
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log
long_query_time = 1
log_queries_not_using_indexes = 1

After 24 hours, run mysqldumpslow to aggregate the results. You will likely find that one query is responsible for 80% of your load.

The I/O Bottleneck

This is where your choice of hosting provider becomes critical. You can tune MySQL buffers all day, but if the underlying storage cannot handle the IOPS (Input/Output Operations Per Second), your database will lock up.

Run iostat -x 1 (part of the sysstat package). Watch the %wa (iowait) column.

Metric Acceptable Range Danger Zone
%wa (I/O Wait) 0% - 5% > 20%
svctm (Service Time) < 5ms > 20ms

If your %wa is consistently over 10-15%, your CPU is sitting idle waiting for the disk to read data. This is common on budget VPS providers that oversell their spinning HDD arrays.

This is why I deploy data-heavy applications on CoolVDS. They use pure SSD storage in RAID 10 configurations. In my benchmarks, the random read speeds on their KVM instances destroy standard cloud instances. When MySQL needs to scan a 2GB table, SSD latency (0.1ms) vs HDD latency (15ms) is the difference between an instant load and a timeout.

4. System Resources and "Steal" Time

Since we are talking about virtualization, we must talk about the "Noisy Neighbor" effect. If you use OpenVZ containers (common in cheap hosting), you are sharing the kernel. If another customer on the node gets DDoS'd, your performance suffers.

Run top and look at the %st (steal) value in the CPU row.

Cpu(s):  1.5%us,  0.5%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

If %st is greater than 0, the hypervisor is stealing CPU cycles from you to give to someone else. This is unacceptable for production environments.

We use CoolVDS because they utilize KVM (Kernel-based Virtual Machine) virtualization with dedicated resource allocation. 0% steal time is the standard there. You get the CPU cycles you pay for.

5. Data Privacy and Latency

We operate in Norway. Latency matters. Routing traffic through a datacenter in Frankfurt or Amsterdam adds 20-40ms to every round trip. For a dynamic application doing 10 database calls and 5 API requests, that latency compounds.

Hosting locally in Oslo reduces latency to the Norwegian Internet Exchange (NIX) to under 2ms. It makes the SSH terminal feel instantaneous.

Furthermore, we have the Personopplysningsloven (Personal Data Act) to consider. The Datatilsynet is becoming increasingly strict about where customer data is stored. Keeping your data on Norwegian soil, protected by our strong privacy laws, is not just a technical advantage—it is a selling point to your CTO.

Conclusion

Performance monitoring isn't about staring at graphs; it's about actionable intelligence. Start with the logs you already have. Enable the slow logs. Watch your I/O wait.

And if you find that your code is optimized but your disk I/O is still the bottleneck, it is time to move. Don't let slow hardware kill your reputation.

Ready to eliminate I/O wait? Spin up a CoolVDS SSD instance today and see what 0% CPU steal looks like.